Tokenization & Embeddings
How raw text transforms into the numbers a model can process — BPE, WordPiece, SentencePiece, dense embeddings, and positional encoding including RoPE.
A deep technical series demystifying how large language models actually work — from the mechanics of tokenization to the algorithms behind alignment. Built for engineers who want to understand, not just use.
How raw text transforms into the numbers a model can process — BPE, WordPiece, SentencePiece, dense embeddings, and positional encoding including RoPE.
The mechanics of scaled dot-product attention, multi-head attention, feed-forward sublayers, layer normalization, and residual connections.
Causal language modeling objectives, the transformer training loop, learning rate schedules, gradient checkpointing, and supervised fine-tuning.
Parameter-efficient fine-tuning via low-rank matrix decomposition, the math behind LoRA, QLoRA's 4-bit quantized base, and practical training strategies.
RLHF from first principles, reward modeling, PPO for language models, and how Direct Preference Optimization sidesteps the reward model entirely.
Floating-point formats, INT8 and INT4 quantization, post-training quantization methods (GPTQ, AWQ), GGUF, and the quality/performance tradeoffs.
The key-value cache in depth, continuous batching, speculative decoding, tensor parallelism, and what production LLM serving infrastructure looks like.
GPU memory hierarchy, IO-aware attention algorithms, FlashAttention v1/v2/v3, and vLLM's PagedAttention for memory-efficient KV cache management.