Fine-Tuning in Practice — Engineermaxxing

Chapter 0: Prompting Has Limits

You're building a customer support bot for a legal firm. The bot needs to respond in a very specific format: numbered action items, formal British English, references to specific internal case numbers, and a particular disclaimer at the bottom of every response.

You write a detailed system prompt. You add five examples. You tweak the temperature. The model still occasionally drops the disclaimer. Sometimes it uses casual language. On complex queries, it forgets the case number format. You spend a week on prompt engineering. The failure rate: 8%.

Eight percent might sound fine until you realize this is legal correspondence. One misfired response and a client complains. The problem is that you're fighting the model's prior — it saw billions of tokens of casual English, and your prompt is a thin layer of instruction on top of that enormous prior.

The core insight: Prompting changes what the model sees. Fine-tuning changes what the model is. When your task requires consistent behavior that a prompt can't reliably enforce — specific format, domain vocabulary, tone, safety rules — you need to change the weights, not the input.

What Prompting Can and Can't Do

Prompting works by steering the model's next-token predictions toward a region of output space. This works well when the region is large and forgiving — "be helpful," "be concise" — and fails when the region is narrow and precise — "always output valid JSON," "never use the word 'however'," "every response ends with a specific three-line footer."

The model's weights encode billions of text patterns. Your system prompt is competing with all of them simultaneously. Fine-tuning, by contrast, actually shifts the weight distributions so that your desired behavior becomes the model's default — its new prior.

Prompt Compliance vs Fine-Tuned Compliance

Drag the slider to see how compliance rate changes with prompt complexity vs fine-tuning. Notice how prompting plateaus but fine-tuning keeps improving with more examples.

Fine-tune examples 0

Three Signals That You Need Fine-Tuning

Signal 1 — Format brittleness: Your prompt says "respond in JSON" and the model outputs markdown with JSON inside it, or adds prose before the JSON block, or uses single quotes instead of double. The model learned JSON from millions of examples where JSON appeared in many contexts. Fine-tuning can make strict JSON output the model's reflex.

Signal 2 — Domain vocabulary: Your task involves proprietary terminology, internal acronyms, or domain-specific meanings that differ from general usage. A legal model needs to know that "discovery" means the pretrial evidence exchange process, not finding something. Fine-tuning on domain text changes how the model represents these tokens.

Signal 3 — Consistent persona or tone: Your prompt says "you are Alex, a friendly but professional assistant." Two hundred tokens into a long conversation, Alex starts sounding like a generic chatbot. Fine-tuning bakes the persona into the weights.

Why does prompting fail at enforcing strict output formats even with detailed instructions?

The model doesn't read system prompts The model's prior from pre-training competes with prompt instructions — the weights encode billions of non-format-compliant outputs Prompts are limited to 100 tokens

Chapter 1: When to Fine-Tune vs When to Prompt

Fine-tuning is not always the answer. It costs time, money, and introduces operational complexity. The question is never "can I fine-tune?" — you almost always can. The question is "should I?"

Here's how to think about it: prompting is a rental. You pay per token, the landlord (the API provider) can change the model tomorrow, and you can move out anytime. Fine-tuning is buying a house. Higher upfront cost, but the behavior is locked in. You own it. And if you're using it constantly, it's cheaper per use.

The decision framework in one sentence: If you're spending more on long system prompts than you'd spend training a fine-tuned model — or if your prompt compliance rate is below 95% for a critical behavior — fine-tune.

The Decision Tree

Step 1

Does prompting + few-shot achieve >95% compliance on your task?

↓ No

Step 2

Do you have (or can you collect) 50+ high-quality labeled examples?

↓ Yes

Step 3

Is the task format/style/persona stable? (Not changing weekly)

↓ Yes

Decision

Fine-tune. The investment pays off.

Real Examples: Prompt vs Fine-Tune

Task	Recommendation	Why
Answer general questions	Prompt	General capability, model already good at it
Classify support tickets into 5 categories	Few-shot or fine-tune	With 200+ examples, fine-tuning cheaper at scale
Extract structured data from medical records	Fine-tune	Format precision + domain vocab critical
Write blog posts in your brand voice	Fine-tune	Consistent persona that prompts can't hold
One-off translation task	Prompt	Not worth the overhead
Detect specific safety violations in user content	Fine-tune	High precision required, runs millions of times/day
Summarize documents you wrote	RAG + prompt	Retrieval beats memorization for facts
Answer questions about your codebase	RAG + prompt	Code changes too fast; don't fine-tune moving targets

The RAG escape hatch: If the main reason you're considering fine-tuning is to inject knowledge (facts, docs, prices, policies that change), consider Retrieval-Augmented Generation (RAG) instead. Fine-tuning teaches behavior; RAG supplies facts. They're complementary, not competing.

Prompt vs Fine-Tune Cost Crossover

Adjust daily request volume and system prompt length to find the crossover point where fine-tuning becomes cheaper than verbose prompting.

Daily requests 10,000

System prompt tokens 500

Your knowledge base updates daily. You want the LLM to answer questions about it accurately. Should you fine-tune?

Yes — fine-tune daily to keep it current Yes — fine-tuning gives better factual accuracy than prompting No — use RAG instead. Fine-tuning injects behavior, not facts. Daily retraining is impractical and expensive.

Chapter 2: Training Data Preparation

The most common mistake in fine-tuning projects is spending two weeks on model selection and two hours on data. The truth is the opposite: the model is mostly decided for you (use the strongest model you can afford), but the data is where all your leverage lives.

Garbage in, garbage out — but it's worse than that. With fine-tuning, inconsistent data is actively poisonous. The model will happily learn contradictions, and the result will be a model that's confidently inconsistent.

The Format: JSONL Conversations

Most fine-tuning APIs (OpenAI, Together, Fireworks) expect data in JSONL format — one JSON object per line, each representing a complete conversation. The conversation follows the familiar messages array structure: system, user, assistant turns.

jsonl — one line per training example
{"messages": [
  {"role": "system", "content": "You are a legal document assistant. Always respond with numbered action items in formal British English."},
  {"role": "user", "content": "Our client received a demand letter. What should we do?"},
  {"role": "assistant", "content": "1. Acknowledge receipt of the demand letter within 5 business days.\n2. Instruct the client to refrain from any communications with the opposing party.\n3. Obtain and review all relevant documentation referenced in the letter.\n4. Schedule a consultation to assess merit and exposure.\n\n*This response does not constitute legal advice.*"}
]}
{"messages": [
  {"role": "system", "content": "You are a legal document assistant. Always respond with numbered action items in formal British English."},
  {"role": "user", "content": "What is the statute of limitations for breach of contract?"},
  {"role": "assistant", "content": "1. Determine the jurisdiction in which the contract was formed and is to be performed.\n2. Identify whether the contract is written or oral, as limitation periods differ.\n3. Note that most jurisdictions impose a 3–6 year limitation for written contracts.\n4. Consult local statutes for any exceptions or tolling provisions.\n\n*This response does not constitute legal advice.*"}
]}

Quality Over Quantity: How Many Examples Do You Need?

The prevailing myth is that fine-tuning requires thousands of examples. This was true for training from scratch. For fine-tuning a pre-trained model, the math is completely different.

The model already knows how to write. It knows grammar, reasoning, world knowledge. You're not teaching it cognition — you're teaching it your specific format and style. That's a much smaller delta. Empirically:

Task Type	Examples Typically Needed	Why
Format / style consistency	50–200	The model knows the content, learns the wrapper
Classification (few categories)	100–500	Needs decision boundary examples across classes
Domain vocabulary / jargon	200–1000	Needs coverage of the new vocabulary in context
Complex reasoning in new domain	500–5000	Needs enough examples to learn domain logic
Full capability shift	10,000+	Teaching a genuinely new skill, not just style

The 50-example test: Before collecting 500 examples, collect 50 really high-quality ones and run a fine-tuning job. Evaluate the result honestly. If it's 80% there, add another 50 targeted examples covering failure cases. Iterate. You'll often reach your goal with 150 examples and save yourself weeks of data collection.

Data Cleaning Checklist

Deduplication: Remove near-duplicate examples. Exact duplicates waste training steps; near-duplicates (same question, slightly different phrasing) can cause the model to memorize rather than generalize. Use fuzzy matching (Jaccard similarity or embedding similarity) to find them.

Balance: If you're teaching the model five output categories, make sure each category is roughly equally represented. A 90/10 split will produce a model that almost always predicts the majority class.

Consistency: Every example of the same scenario type should produce the same output format. If three examples end with the disclaimer and two don't, the model will learn to include it 60% of the time — worse than no fine-tuning at all.

Length distribution: Include examples of short, medium, and long responses if you want the model to handle all lengths. A dataset of only short examples produces a model that truncates long responses.

python — data cleaning script
import json
from collections import Counter
import hashlib

def load_jsonl(path):
    with open(path) as f:
        return [json.loads(line) for line in f if line.strip()]

def dedup(examples):
    seen = set()
    clean = []
    for ex in examples:
        # Hash the user turn + assistant turn together
        key = "".join(m["content"] for m in ex["messages"])
        h = hashlib.md5(key.encode()).hexdigest()
        if h not in seen:
            seen.add(h)
            clean.append(ex)
    return clean

def check_format(examples):
    # Verify every example has the required disclaimer
    bad = []
    for i, ex in enumerate(examples):
        last = ex["messages"][-1]["content"]
        if "does not constitute legal advice" not in last:
            bad.append(i)
    return bad

data = load_jsonl("training.jsonl")
data = dedup(data)
bad_idx = check_format(data)
print(f"{len(bad_idx)} examples missing disclaimer: fix before training")

Data Balance Visualizer

Drag sliders to set class proportions. See how imbalance affects model confidence on the minority class. A balanced dataset (equal bars) gives the best per-class accuracy.

Class A100

Class B100

Class C100

Class D100

You have 500 customer support tickets. 450 are "billing issues," 50 are "technical problems." You fine-tune a classifier. What's the most likely failure mode?

The model won't learn anything because 500 examples is too few The model will classify most "technical problems" as "billing issues" because the training data is 90% billing The JSONL format will cause a training error

Chapter 3: LoRA & QLoRA — Fine-Tuning Without Breaking the Bank

A 7-billion parameter model has 7 billion floating-point numbers. Full fine-tuning means computing gradients for all 7 billion, storing optimizer state (Adam needs 2x extra), and keeping all of that in GPU memory simultaneously. For a 7B model, that's roughly 84 GB of VRAM — more than four consumer GPUs combined.

Most teams can't afford that. But here's the key insight: you don't need to.

The core observation: When you fine-tune a model, the change to the weight matrices — the delta between pre-trained weights and fine-tuned weights — tends to be low-rank. You're not learning an arbitrary transformation; you're learning a specific style shift. That shift lives in a much lower-dimensional space than the full weight matrix.

The LoRA Math

A weight matrix W in a transformer has shape [d_model, d_model], e.g., [4096, 4096] for a 7B model. That's 16 million parameters per matrix. Low-Rank Adaptation (LoRA) freezes W and learns a small delta instead:

ΔW = A × B where A ∈ ℝ^{d × r}, B ∈ ℝ^{r × d}

Here r is the rank — typically 4, 8, or 16. Instead of learning 16M parameters (4096×4096), you learn 2×4096×8 = 65,536 parameters. That's a 244× reduction in trainable parameters.

At inference, the delta is simply added to the frozen weight: W' = W + α/r × A×B, where α is a scaling factor (often equal to r, so α/r = 1). The full model runs at normal speed — no overhead at inference time once you merge the adapter.

LoRA Low-Rank Decomposition — Interactive

A full weight update ΔW (left) has millions of parameters. LoRA approximates it as A×B (right). Drag rank r to see how the approximation quality trades off against parameter count.

Rank r 8

Which Matrices to Adapt?

Transformers have four weight matrices in each attention block: Q (query), K (key), V (value), O (output projection), plus two in the MLP (up-projection, down-projection). LoRA originally applied to Q and V only. In practice, adapting all six gives better results for the same rank.

python — LoRA with Hugging Face PEFT
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

lora_config = LoraConfig(
    r=16,               # rank — higher = more capacity, more params
    lora_alpha=32,       # scaling: effective lr = lr * alpha/r
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.52%

QLoRA: Quantize the Base, Train the Adapter

QLoRA (Dettmers et al., 2023) goes further: it quantizes the frozen base model weights to 4-bit integers (reducing 7B × 16 bits = 14 GB to 7B × 4 bits = 3.5 GB), then trains LoRA adapters in bfloat16. The result: you can fine-tune a 7B model on a single 24 GB GPU (RTX 3090 or 4090).

The quantization introduces some noise, but the LoRA training compensates for it. QLoRA typically achieves 95-97% of full fine-tuning quality at 5-10% of the compute cost.

python — QLoRA setup with bitsandbytes
from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",         # NormalFloat4 — best for normal distributions
    bnb_4bit_compute_dtype=torch.bfloat16, # compute in bf16, store in 4-bit
    bnb_4bit_use_double_quant=True,     # quantize the quantization constants too
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
# Now add LoRA adapters as before
# GPU memory: ~6 GB instead of ~16 GB

Choosing Rank r

Rank selection is the main hyperparameter in LoRA. The rule of thumb: start with r=16. If your validation loss plateaus early and you have enough data, try r=32 or r=64. If you're memory-constrained or have very few examples, r=4 or r=8 works surprisingly well for style and format tasks.

Rank	Trainable params (7B model)	Best for
r=4	~10M (0.13%)	Style/format, very few examples
r=8	~21M (0.26%)	Classification, tone, light adaptation
r=16	~42M (0.52%)	Most tasks — good default
r=64	~168M (2.1%)	Complex reasoning, heavy domain shift
Full fine-tune	8B (100%)	When nothing else works and you have a cluster

LoRA trains adapter matrices A and B instead of updating W directly. What property of fine-tuning makes this work?

Pre-trained models are wrong, so any adapter will improve them LoRA only works on small models under 1B parameters The weight changes during fine-tuning are low-rank — the actual behavioral delta lives in a small subspace, so A×B captures it faithfully

Chapter 4: The Fine-Tuning Pipeline

Three main paths to fine-tuning, ordered by ease vs control:

OpenAI API

Upload JSONL → call API → done. Easiest. No GPU needed. Limited model choice.

↓ more control

Hugging Face + PEFT

Full control. Any open model. Run on your hardware or cloud. More setup.

↓ even more control

Axolotl

YAML config-driven. Handles multi-GPU, packing, flash attention automatically.

Path 1: OpenAI Fine-Tuning API

The simplest path. You upload a JSONL file, specify hyperparameters (or let the API choose), and wait. The API runs the training job on OpenAI's infrastructure. You pay per training token, then per inference token on the resulting model.

python — OpenAI fine-tuning API end-to-end
from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY from environment

# Step 1: Upload training data
with open("training.jsonl", "rb") as f:
    response = client.files.create(file=f, purpose="fine-tune")
file_id = response.id
print(f"Uploaded file: {file_id}")

# Step 2: Start fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file_id,
    model="gpt-4o-mini-2024-07-18",  # cheapest capable model
    hyperparameters={
        "n_epochs": 3,               # 3 passes through your data
        "batch_size": "auto",         # let API decide
        "learning_rate_multiplier": "auto",
    },
)
print(f"Job started: {job.id}")

# Step 3: Poll for completion
import time
while True:
    status = client.fine_tuning.jobs.retrieve(job.id)
    print(f"Status: {status.status}")
    if status.status in ["succeeded", "failed"]:
        break
    time.sleep(30)

# Step 4: Use the fine-tuned model
ft_model = status.fine_tuned_model  # e.g., "ft:gpt-4o-mini:org:name:id"
response = client.chat.completions.create(
    model=ft_model,
    messages=[{"role": "user", "content": "Our client got a demand letter."}]
)
print(response.choices[0].message.content)

Path 2: Hugging Face + PEFT + Trainer

For open-weight models (Llama, Mistral, Qwen), you run training yourself. The Hugging Face ecosystem provides the building blocks: transformers for the model, peft for LoRA, trl for the SFT Trainer (Supervised Fine-Tuning).

python — HuggingFace SFT with LoRA/QLoRA
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

dataset = load_dataset("json", data_files="training.jsonl", split="train")

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        output_dir="./checkpoints",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,   # effective batch = 16
        learning_rate=2e-4,
        warmup_ratio=0.03,
        lr_scheduler_type="cosine",
        save_steps=100,
        logging_steps=10,
        bf16=True,
        max_seq_length=2048,
    ),
    train_dataset=dataset,
    peft_config=lora_config,
)
trainer.train()
trainer.save_model("./final-adapter")

Path 3: Axolotl

Axolotl wraps the Hugging Face ecosystem in a YAML config file. Instead of writing 100 lines of Python, you write a 40-line config and run one command. It handles gradient checkpointing, flash attention, data packing, multi-GPU setups, and W&B logging automatically.

yaml — axolotl config (config.yaml)
base_model: meta-llama/Llama-3.1-8B-Instruct
model_type: AutoModelForCausalLM

load_in_4bit: true           # QLoRA
adapter: lora
lora_r: 16
lora_alpha: 32
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj

datasets:
  - path: training.jsonl
    type: chat_template

num_epochs: 3
micro_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 0.0002
lr_scheduler: cosine
warmup_steps: 10

output_dir: ./output
save_safetensors: true

bash — run axolotl
pip install axolotl
accelerate launch -m axolotl.cli.train config.yaml

Training Loss Curve — Watch It Learn

A healthy training run shows training loss falling smoothly while validation loss tracks it closely. Divergence between them signals overfitting. Click to simulate a run with different data sizes.

Epochs 3

In the SFT Trainer config, gradient_accumulation_steps=8 with per_device_train_batch_size=2 gives an effective batch size of 16. Why use gradient accumulation instead of just setting batch_size=16?

Gradient accumulation trains faster than large batches Large batches require more GPU memory to hold activations. Accumulation simulates a large batch by doing 8 small forward/backward passes before one optimizer step — same math, much less peak memory Batch size is limited to 2 by the PEFT library

Chapter 5: Evaluation — How to Know It Worked

You ran the training job. Loss went down. Now what? Many teams ship the model and discover problems in production. Don't do that. Evaluation is where you find out if you actually got what you wanted — before your users do.

There are three things you need to check: (1) did it learn the target behavior, (2) did it not break anything else, and (3) is it actually better than what you had before?

The Hold-Out Test Set

Before you write a single training example, split your labeled data into training and test sets. The test set is sacred — it never touches the training job. A typical split: 80% train, 10% validation (used during training to check for overfitting), 10% test (used only at the end).

The test set should be representative of your real production distribution. If you fine-tuned for legal correspondence, your test set should contain the same variety of legal questions and request types you'll see in production — not just the easy ones.

Task-Specific Metrics

Generic metrics like perplexity don't tell you if the model is doing your specific task correctly. Define task-specific metrics that match what you actually care about:

Task	Metric	How to Compute
Format compliance	% valid outputs	Parse output with regex or JSON parser; count successes
Classification	F1 per class	Compare predicted label to ground truth
Extraction (structured data)	Field-level accuracy	Parse JSON output; check each field independently
Summarization	ROUGE-L + human eval	Automated n-gram overlap + sample review
Tone/style	LLM-as-judge	GPT-4 evaluates each output on a rubric

LLM-as-Judge

For subjective qualities (tone, professionalism, accuracy), human evaluation is the gold standard but expensive. LLM-as-judge uses a powerful model (GPT-4o, Claude 3.5 Sonnet) to evaluate your fine-tuned model's outputs against a rubric. It's not perfect — LLMs have biases toward verbose outputs and their own style — but it scales.

python — LLM-as-judge evaluation
def evaluate_response(client, question, response, rubric):
    judge_prompt = f"""Evaluate this assistant response on a 1-5 scale.

Question: {question}
Response: {response}

Rubric:
{rubric}

Output ONLY a JSON object: {{"score": <1-5>, "reason": "<one sentence>"}}"""

    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": judge_prompt}],
        response_format={"type": "json_object"}
    )
    return json.loads(result.choices[0].message.content)

rubric = """1: Completely wrong format or missing disclaimer
2: Right format but informal language
3: Formal British English, numbered, but missing one required element
4: All elements present, professional tone
5: Perfect — exactly matches the target style with all requirements"""

scores = [evaluate_response(client, q, r, rubric) for q, r in test_pairs]
avg = sum(s["score"] for s in scores) / len(scores)
print(f"Average judge score: {avg:.2f}/5.0")

Regression Testing — Don't Break General Capabilities

Fine-tuning can cause catastrophic forgetting: the model improves on your task but loses competence elsewhere. A model fine-tuned on legal correspondence might start responding to casual "tell me a joke" requests with formal numbered action items.

Run your fine-tuned model on a set of general capability prompts (math, coding, common sense) and compare to the base model. If general scores drop more than 5-10%, you've over-trained or used inconsistent data.

A/B Evaluation: Base vs Fine-Tuned

Click a test prompt to see a side-by-side comparison of base model and fine-tuned model output. Notice format compliance and tone differences.

After fine-tuning, your model scores 95% on your task but now answers casual questions with formal numbered action items. What happened?

The model was fine-tuned correctly — formal responses are better The test set was too small to detect this problem Catastrophic forgetting — the model over-specialized and lost context-appropriate general behavior. Fix by including diverse examples in training data or reducing epochs.

Chapter 6: Deployment — Serving Fine-Tuned Models

Your model is trained and evaluated. Now it needs to serve real traffic. Deploying a fine-tuned model has different considerations from deploying a stock model — especially if you used LoRA adapters.

LoRA Merging vs Runtime Adapters

After training a LoRA adapter, you have two choices for how to serve it.

Option 1 — Merge and export: Permanently bake the adapter into the base model weights (W' = W + A×B), then serve the merged model. This is the simplest deployment: the resulting model is identical in shape to the base model, runs at full speed, and any serving infrastructure that handles the base model handles your fine-tuned model.

Option 2 — Runtime adapters: Keep the adapter separate from the base model. Load the adapter on top of the base model at inference time. This enables serving multiple adapters on a single base model — a significant cost saving when you have ten fine-tuned variants for ten different customers.

python — merging LoRA adapter into base model
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base, "./final-adapter")

# Merge adapter weights into base — creates a standard model
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").save_pretrained("./merged-model")

# Now serve ./merged-model with vLLM, Ollama, or any standard serving stack

Multi-LoRA Serving

Suppose you have 20 customers, each with a fine-tuned model. Naively that's 20 separate model deployments — 20× the GPU cost. With multi-LoRA serving, you load one base model and swap adapters per request. The adapter is small (tens of MB), so switching is fast. Tools like vLLM, LoRAX, and S-LoRA support this.

bash — serving multiple LoRA adapters with vLLM
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules \
    legal-firm-a=./adapters/firm-a \
    legal-firm-b=./adapters/firm-b \
    customer-support=./adapters/support \
  --max-lora-rank 64

python — routing requests to correct adapter
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# Each tenant gets their adapter by name
response = client.chat.completions.create(
    model="legal-firm-a",  # routes to that LoRA adapter
    messages=[{"role": "user", "content": "Explain discovery obligations."}]
)

Versioning and Rollback

Fine-tuned models need versioning just like software. When you retrain on new data, the old version should remain accessible for at least a week — both for rollback in case of regressions and for A/B testing the new version against production traffic.

A simple convention: name adapters with a timestamp and data version. legal-v2-20260501. Keep at least two versions deployed simultaneously during the transition window.

Multi-LoRA Serving Architecture

Click a tenant button to route a request through the serving stack. See how one base model serves multiple customers with different adapters.

You serve 15 fine-tuned variants of Llama 3.1 8B. What's the GPU memory advantage of multi-LoRA serving vs deploying each as a separate merged model?

No advantage — adapters are just as large as the full model Multi-LoRA loads one copy of the 8B base model (16 GB) plus small adapters (<100 MB each), vs 15 merged models at 16 GB each = 240 GB. Roughly 15× memory reduction. Multi-LoRA can only serve one adapter at a time

Chapter 7: Cost Analysis — The Break-Even Calculation

Fine-tuning has a fixed upfront cost (data collection + training compute) and then a reduced inference cost (smaller model, shorter prompts). Prompting has zero upfront cost but a higher ongoing inference cost (long system prompts + few-shot examples at every request).

At some request volume, fine-tuning becomes the cheaper option. Finding that crossover point is the break-even analysis.

The Math

Define:

C_train = total training cost (one-time)
P_prompt = cost per request with prompting (system prompt + few-shot tokens)
P_ft = cost per request with fine-tuned model (shorter prompt, smaller/cheaper model)
N = number of requests before break-even

N = C_train ÷ (P_prompt − P_ft)

Example: You use GPT-4o with a 1500-token system prompt + 3 few-shot examples (800 tokens) = 2300 tokens of overhead per request at $5/M input tokens = $0.0115 per request. A fine-tuned GPT-4o-mini with a 50-token prompt costs $0.00015 per request. Delta = $0.01135 per request. Training cost (5 hours of OpenAI fine-tuning at ~$10/hour) = $50. Break-even = 50 / 0.01135 ≈ 4,400 requests.

At 1,000 requests/day, you break even in 4.4 days. After that, every request saves $0.01135.

Realistic Cost Breakdown

Item	OpenAI API	Self-hosted (A100)
Training 7B model, 500 examples, 3 epochs	~$5-15	~$2-8 (cloud GPU)
Training 70B model, 1000 examples, 3 epochs	~$100-300	~$50-150 (8×A100)
Inference cost (per 1M tokens)	$0.15-$5 (gpt-4o-mini to gpt-4o)	$0.10-$0.40 (self-hosted)
Data collection (100 examples)	$0-2,000 (human annotation)	Same
Serving infrastructure	Included in API price	$500-3000/month (GPU rental)

The hidden cost: Data collection is often the largest expense. 500 high-quality labeled examples at $4 each (skilled annotator time) = $2,000. This dwarfs the $15 training cost. The break-even analysis must include data cost. With $2,050 total cost and $0.01135 savings/request, you need 180,000 requests to break even — about 6 months at 1,000/day.

Break-Even Calculator

Set your parameters to find the crossover point where fine-tuning becomes cheaper than prompting.

Training cost ($)$200

Prompt tokens overhead1500

Requests per day5,000

Base model price ($/M tok)$5

Your fine-tuning training cost is $500. Prompting costs $0.02/request; fine-tuned model costs $0.005/request. At what request volume do you break even?

25,000 requests 5,000 requests 500 / (0.02 - 0.005) = 500 / 0.015 ≈ 33,333 requests

Chapter 8: Fine-Tuning Decision Tool

Describe your use case and constraints. The tool will recommend whether to prompt, few-shot, fine-tune, or use RAG — with estimated costs and quality projections. Adjust sliders to see how tradeoffs shift.

Interactive Fine-Tuning Advisor

Task type

Knowledge changes how often?

Labeled examples available200

Daily requests5,000

Budget for setup ($)$500

Latency tolerance (ms)1,000ms

Compliance requirement90%

Chapter 9: Connections — Where Fine-Tuning Leads

Supervised fine-tuning is one rung on a ladder. Once you understand it, three major directions open up — each building on the same foundation but going further.

Fine-Tuning → RLHF

Reinforcement Learning from Human Feedback starts where SFT ends. After SFT, you have a model that mimics your examples. But mimicry has a ceiling — the model will reproduce the average of your training data, including its mediocre examples.

RLHF adds a second phase: human raters compare pairs of outputs and indicate which they prefer. A reward model is trained on these preferences. Then the policy (your SFT model) is optimized with PPO to maximize the reward model's score. This is how ChatGPT, Claude, and Gemini go from "capable base model" to "genuinely helpful assistant."

Modern shortcut: Direct Preference Optimization (DPO) skips the separate reward model entirely, directly fine-tuning on preference pairs. Same outcome, simpler pipeline. DPO is now the dominant approach for preference-based fine-tuning.

Fine-Tuning → Knowledge Distillation

Distillation uses a large teacher model to generate training data for a smaller student model. You run GPT-4 on 10,000 prompts, collect its outputs, and fine-tune Llama 3.1 8B on those outputs. The student learns to mimic the teacher's behavior at a fraction of the inference cost.

This is called knowledge distillation via SFT. The key insight: GPT-4's outputs are higher quality than human-written labels for most tasks. You get teacher-quality training signal at scale without expensive human annotation. Microsoft's Phi series of small models was trained this way.

Fine-Tuning → Domain Adaptation

The techniques you've learned here apply beyond instruction following. Domain adaptation fine-tunes on domain-specific raw text (medical literature, legal documents, financial filings) to shift the model's knowledge distribution before task-specific SFT.

The two-stage recipe: (1) continued pre-training on domain text — this teaches vocabulary and domain knowledge; (2) instruction fine-tuning on domain-specific Q&A pairs — this teaches task-specific behavior. BioMedLM, BloombergGPT, and CodeLlama all follow this pattern.

Technique	Builds on SFT by adding	When to use
RLHF / PPO	Human preference feedback + reward model + RL optimization	When quality ceiling from imitation is too low
DPO	Direct preference pairs, no reward model	When you want RLHF benefits without PPO complexity
Distillation	Teacher model generates labels	When you want small-model quality without human labels
Domain pre-training	Raw domain text before SFT	When domain vocabulary and knowledge are the bottleneck
Continual fine-tuning	Periodic re-training on new data	When task data grows over time

The meta-lesson: Fine-tuning is not a technique — it's a principle. You have a model with general knowledge. You have specific knowledge about your task. Fine-tuning is the process of transferring your specific knowledge into the model's weights. Every technique above — RLHF, DPO, distillation, domain adaptation — is a different answer to the question "what kind of knowledge, and from what source?"

Related Lessons

Deepen the theory

Practical next steps

"What I cannot create, I do not understand." — Richard Feynman

You've learned fine-tuning by understanding every piece: the data format, the LoRA math, the training pipeline, the evaluation framework, the deployment architecture, and the cost math. Now create something with it.