8-Part Technical Series

LLM Internals

A deep technical series demystifying how large language models actually work — from the mechanics of tokenization to the algorithms behind alignment. Built for engineers who want to understand, not just use.

8 Articles

~280 Min total read

30+ Interactive demos

The Series

Foundations

Tokenization & Embeddings

How raw text transforms into the numbers a model can process — BPE, WordPiece, SentencePiece, dense embeddings, and positional encoding including RoPE.

~30 min read Read →

Architecture

Attention & Transformer Blocks

The mechanics of scaled dot-product attention, multi-head attention, feed-forward sublayers, layer normalization, and residual connections.

~45 min read Read →

Training

Training & Fine-tuning

Causal language modeling objectives, the transformer training loop, learning rate schedules, gradient checkpointing, and supervised fine-tuning.

~40 min read Read →

Efficiency

LoRA & QLoRA

Parameter-efficient fine-tuning via low-rank matrix decomposition, the math behind LoRA, QLoRA's 4-bit quantized base, and practical training strategies.

~35 min read Read →

Alignment

DPO & Alignment

RLHF from first principles, reward modeling, PPO for language models, and how Direct Preference Optimization sidesteps the reward model entirely.

~40 min read Read →

Efficiency

Quantization

Floating-point formats, INT8 and INT4 quantization, post-training quantization methods (GPTQ, AWQ), GGUF, and the quality/performance tradeoffs.

~35 min read Read →

Inference

KV Cache & Inference Systems

The key-value cache in depth, continuous batching, speculative decoding, tensor parallelism, and what production LLM serving infrastructure looks like.

~35 min read Read →

Systems

FlashAttention & PagedAttention

GPU memory hierarchy, IO-aware attention algorithms, FlashAttention v1/v2/v3, and vLLM's PagedAttention for memory-efficient KV cache management.

~40 min read Read →