8-Part Technical Series

LLM Internals

A deep technical series demystifying how large language models actually work — from the mechanics of tokenization to the algorithms behind alignment. Built for engineers who want to understand, not just use.

8 Articles
~280 Min total read
30+ Interactive demos
01
Foundations

Tokenization & Embeddings

How raw text transforms into the numbers a model can process — BPE, WordPiece, SentencePiece, dense embeddings, and positional encoding including RoPE.

02
Architecture

Attention & Transformer Blocks

The mechanics of scaled dot-product attention, multi-head attention, feed-forward sublayers, layer normalization, and residual connections.

03
Training

Training & Fine-tuning

Causal language modeling objectives, the transformer training loop, learning rate schedules, gradient checkpointing, and supervised fine-tuning.

04
Efficiency

LoRA & QLoRA

Parameter-efficient fine-tuning via low-rank matrix decomposition, the math behind LoRA, QLoRA's 4-bit quantized base, and practical training strategies.

05
Alignment

DPO & Alignment

RLHF from first principles, reward modeling, PPO for language models, and how Direct Preference Optimization sidesteps the reward model entirely.

06
Efficiency

Quantization

Floating-point formats, INT8 and INT4 quantization, post-training quantization methods (GPTQ, AWQ), GGUF, and the quality/performance tradeoffs.

07
Inference

KV Cache & Inference Systems

The key-value cache in depth, continuous batching, speculative decoding, tensor parallelism, and what production LLM serving infrastructure looks like.

08
Systems

FlashAttention & PagedAttention

GPU memory hierarchy, IO-aware attention algorithms, FlashAttention v1/v2/v3, and vLLM's PagedAttention for memory-efficient KV cache management.