Transformer — AI Architecture Atlas

The Idea

What Is It?

A Transformer is a neural network that processes sequences by letting every element attend to every other element in parallel. Instead of reading tokens one-by-one like an RNN, it computes all pairwise relationships simultaneously using a mechanism called self-attention.

This simple idea — "Attention Is All You Need" — replaced recurrence with parallelism, unlocking massive scaling on GPUs. Every modern LLM, from GPT-4 to Claude to Gemini, is built on this foundation. The core architecture has barely changed since 2017; what changed is the scale and the training recipe.

Architecture

The Full Transformer Block

Click any component to highlight its data flow. The decoder-only variant (used by GPT, Claude, Llama) stacks N identical blocks, each with self-attention and a feed-forward network.

Click components to explore

Data Flow

Input / Output

Input

A sequence of token IDs (integers) mapped to dense vectors via an embedding table. Shape: [B, T, d_model] where T is sequence length and d_model is typically 4096–12288.

→

Output

A probability distribution over the vocabulary for the next token at each position. During autoregressive generation, only the last position's logits matter. Shape: [B, T, V] where V is vocab size (~32k–256k).

Core Mechanisms

The Four Pillars

Self-Attention

Each token computes Query, Key, Value vectors. Attention weights = softmax(QK^T / √d_k). The output is a weighted sum of Values. This lets every token "look at" every other token.

Multi-Head Attention

Instead of one big attention, split into H parallel heads (e.g., 32–128), each attending to a different subspace. Concat and project. This lets the model capture different relationship types simultaneously.

Causal Masking

In decoder-only models, each token can only attend to itself and earlier tokens. This is enforced by setting future positions to −∞ before softmax — a triangular mask that prevents information leaking from the future.

Cross-Attention

Used in encoder-decoder models (T5, original Transformer). Queries come from the decoder; Keys and Values come from the encoder output. This is how the decoder "reads" the input sequence during translation or summarization.

Interactive

Self-Attention: Step by Step

Watch how Query, Key, and Value matrices interact. Hover over tokens to see which other tokens they attend to and with what weight.

Hover tokens to see attention weights

Position

Positional Encodings

Transformers have no built-in notion of order. Positional encodings inject sequence position information. Three dominant approaches:

Sinusoidal

Fixed sine/cosine waves at different frequencies. Original paper. No learned parameters. Theoretically generalizes to any length.

RoPE

Rotary Position Embedding. Rotates Q and K vectors by position-dependent angles. Used by Llama, Mistral, Qwen. Relative by design.

ALiBi

Attention with Linear Biases. No positional embedding — just subtracts a linear penalty proportional to distance. Simple and effective.

Evolution

Key Innovations

The Transformer has been iteratively improved since 2017. The biggest wins:

2019

KV Caching

Cache Key/Value matrices for past tokens during autoregressive decode. Avoids recomputation. Now universal.

2022

Flash Attention

Fuses attention into a single GPU kernel with tiling. 2–4x faster, O(1) extra memory. Changed everything.

2023

Ring Attention

Distributes attention across devices in a ring. Enables million-token contexts by splitting the KV cache across GPUs.

2022

Mixture of Experts

Replace dense FFN with sparse routed experts. Mixtral uses 8 experts, picks top-2. More capacity, same FLOPs.

2023

Speculative Decoding

Draft tokens with a small model, verify in parallel with the big model. 2–3x faster inference without quality loss.

2023

GQA / MQA

Grouped-Query / Multi-Query Attention. Share K/V heads across query heads. Cuts KV cache by 4–8x. Used in Llama 2+.

2021

RoPE

Rotary Position Embeddings encode relative position via rotation matrices. Dominant in open-source LLMs.

2023

Sliding Window

Limit attention to a local window + global tokens. Mistral's approach to efficient long-context without full quadratic cost.

Training

Training Pipeline

Stage 1: Pretraining

Massive next-token prediction on internet-scale data. This is where the model learns language, facts, and reasoning patterns.

Data ~15T tokens

→

Tokenize BPE / SentencePiece

→

Forward N layers

→

Loss Cross-entropy

→

Backprop Gradients

→

Optimize AdamW

Stage 2: Alignment

Fine-tune the model to follow instructions, be helpful, and avoid harm.

SFT Instruction tuning

→

RLHF / DPO Human preference

→

Safety Red-teaming

Efficient Fine-tuning

Full fine-tuning a 70B model requires hundreds of GB of memory. LoRA freezes the base model and trains low-rank adapter matrices (rank 8–64), reducing trainable parameters by 99%+. QLoRA goes further by quantizing the base model to 4-bit, making fine-tuning possible on a single GPU.

Inference

Inference Pipeline

Generation happens in two phases: prefill (process the full prompt in parallel) and decode (generate tokens one at a time using the KV cache).

KV Cache avoids recomputing past tokens

Prompt User input

→

Prefill Parallel encode

→

KV Cache Store K, V

→

Decode Autoregressive

→

Sample Top-p / Temp

Landscape

Model Zoo

Notable Transformer-based models and their key specs.

Model	Params	Context	Attention	Key Features
GPT-4o	`~1.8T (MoE)`	`128k`	GQA	Multimodal, MoE, tool use
Claude 3.5 Sonnet	`undisclosed`	`200k`	—	Long context, strong reasoning, tool use
Gemini 1.5 Pro	`~1T (MoE)`	`1M+`	Ring Attention	Extreme context window, multimodal native
Llama 3.1 405B	`405B`	`128k`	GQA + RoPE	Open-weight, dense, multilingual
Mistral Large	`123B`	`128k`	GQA + SWA	Sliding window attention, efficient
Mixtral 8x22B	`176B (39B active)`	`64k`	GQA + MoE	Sparse MoE, 8 experts top-2 routing
DeepSeek-V3	`671B (37B active)`	`128k`	MLA + MoE	Multi-head Latent Attention, 256 experts
Qwen 2.5 72B	`72B`	`128k`	GQA + RoPE	Strong multilingual, code, math

Connections

→ Atlas 02 State Space Models → Atlas 07 Contrastive / CLIP → Atlas 12 Reward & Alignment → Deep Dive Series LLM Internals (8 parts)