Architecture Overview

Transformer

The architecture behind GPT, Claude, Gemini, and everything else.

Year 2017
Creator Vaswani et al.
Category Sequence Model
Complexity O(n²d)
The Idea

What Is It?

A Transformer is a neural network that processes sequences by letting every element attend to every other element in parallel. Instead of reading tokens one-by-one like an RNN, it computes all pairwise relationships simultaneously using a mechanism called self-attention.

This simple idea — "Attention Is All You Need" — replaced recurrence with parallelism, unlocking massive scaling on GPUs. Every modern LLM, from GPT-4 to Claude to Gemini, is built on this foundation. The core architecture has barely changed since 2017; what changed is the scale and the training recipe.

Architecture

The Full Transformer Block

Click any component to highlight its data flow. The decoder-only variant (used by GPT, Claude, Llama) stacks N identical blocks, each with self-attention and a feed-forward network.

Click components to explore
Data Flow

Input / Output

Input

A sequence of token IDs (integers) mapped to dense vectors via an embedding table. Shape: [B, T, d_model] where T is sequence length and d_model is typically 4096–12288.

Output

A probability distribution over the vocabulary for the next token at each position. During autoregressive generation, only the last position's logits matter. Shape: [B, T, V] where V is vocab size (~32k–256k).

Core Mechanisms

The Four Pillars

Self-Attention

Each token computes Query, Key, Value vectors. Attention weights = softmax(QKT / √dk). The output is a weighted sum of Values. This lets every token "look at" every other token.

Multi-Head Attention

Instead of one big attention, split into H parallel heads (e.g., 32–128), each attending to a different subspace. Concat and project. This lets the model capture different relationship types simultaneously.

Causal Masking

In decoder-only models, each token can only attend to itself and earlier tokens. This is enforced by setting future positions to −∞ before softmax — a triangular mask that prevents information leaking from the future.

Cross-Attention

Used in encoder-decoder models (T5, original Transformer). Queries come from the decoder; Keys and Values come from the encoder output. This is how the decoder "reads" the input sequence during translation or summarization.

Interactive

Self-Attention: Step by Step

Watch how Query, Key, and Value matrices interact. Hover over tokens to see which other tokens they attend to and with what weight.

Hover tokens to see attention weights
Position

Positional Encodings

Transformers have no built-in notion of order. Positional encodings inject sequence position information. Three dominant approaches:

Sinusoidal

Fixed sine/cosine waves at different frequencies. Original paper. No learned parameters. Theoretically generalizes to any length.

RoPE

Rotary Position Embedding. Rotates Q and K vectors by position-dependent angles. Used by Llama, Mistral, Qwen. Relative by design.

ALiBi

Attention with Linear Biases. No positional embedding — just subtracts a linear penalty proportional to distance. Simple and effective.

Evolution

Key Innovations

The Transformer has been iteratively improved since 2017. The biggest wins:

2019

KV Caching

Cache Key/Value matrices for past tokens during autoregressive decode. Avoids recomputation. Now universal.

2022

Flash Attention

Fuses attention into a single GPU kernel with tiling. 2–4x faster, O(1) extra memory. Changed everything.

2023

Ring Attention

Distributes attention across devices in a ring. Enables million-token contexts by splitting the KV cache across GPUs.

2022

Mixture of Experts

Replace dense FFN with sparse routed experts. Mixtral uses 8 experts, picks top-2. More capacity, same FLOPs.

2023

Speculative Decoding

Draft tokens with a small model, verify in parallel with the big model. 2–3x faster inference without quality loss.

2023

GQA / MQA

Grouped-Query / Multi-Query Attention. Share K/V heads across query heads. Cuts KV cache by 4–8x. Used in Llama 2+.

2021

RoPE

Rotary Position Embeddings encode relative position via rotation matrices. Dominant in open-source LLMs.

2023

Sliding Window

Limit attention to a local window + global tokens. Mistral's approach to efficient long-context without full quadratic cost.

Training

Training Pipeline

Stage 1: Pretraining

Massive next-token prediction on internet-scale data. This is where the model learns language, facts, and reasoning patterns.

Data ~15T tokens
Tokenize BPE / SentencePiece
Forward N layers
Loss Cross-entropy
Backprop Gradients
Optimize AdamW

Stage 2: Alignment

Fine-tune the model to follow instructions, be helpful, and avoid harm.

SFT Instruction tuning
RLHF / DPO Human preference
Safety Red-teaming

Efficient Fine-tuning

Full fine-tuning a 70B model requires hundreds of GB of memory. LoRA freezes the base model and trains low-rank adapter matrices (rank 8–64), reducing trainable parameters by 99%+. QLoRA goes further by quantizing the base model to 4-bit, making fine-tuning possible on a single GPU.

Inference

Inference Pipeline

Generation happens in two phases: prefill (process the full prompt in parallel) and decode (generate tokens one at a time using the KV cache).

KV Cache avoids recomputing past tokens
Prompt User input
Prefill Parallel encode
KV Cache Store K, V
Decode Autoregressive
Sample Top-p / Temp
Landscape

Model Zoo

Notable Transformer-based models and their key specs.

Model Params Context Attention Key Features
GPT-4o ~1.8T (MoE) 128k GQA Multimodal, MoE, tool use
Claude 3.5 Sonnet undisclosed 200k Long context, strong reasoning, tool use
Gemini 1.5 Pro ~1T (MoE) 1M+ Ring Attention Extreme context window, multimodal native
Llama 3.1 405B 405B 128k GQA + RoPE Open-weight, dense, multilingual
Mistral Large 123B 128k GQA + SWA Sliding window attention, efficient
Mixtral 8x22B 176B (39B active) 64k GQA + MoE Sparse MoE, 8 experts top-2 routing
DeepSeek-V3 671B (37B active) 128k MLA + MoE Multi-head Latent Attention, 256 experts
Qwen 2.5 72B 72B 128k GQA + RoPE Strong multilingual, code, math
Related

Connections