Activation Functions — ReLU to SwiGLU

Chapter 0: The Dead Network Problem

Stack ten linear layers. The result? Mathematically identical to a single linear layer. No matter how deep you go, you can only model straight lines, flat planes, linear boundaries. A 100-layer linear network has the same representational power as a 1-layer one. That's why activation functions exist — they inject the nonlinearity that lets deep networks model curves, corners, and complexity.

This isn't obvious until you see it fail. Take a classification problem where the classes aren't separable by a straight line — say, two interlocking spirals or concentric circles. A linear model, no matter how many layers, draws a single straight boundary. It fails catastrophically. Add one nonlinear activation function between each layer, and suddenly the network can bend, twist, and carve out curved decision regions.

But "just add nonlinearity" hides a trap. The wrong nonlinearity kills your network in a different way — by crushing gradients to zero so weights stop updating. This chapter shows both failure modes: the futility of depth without nonlinearity, and the danger of choosing poorly.

Why Linear Layers Collapse

A single linear layer computes y = Wx + b. Stack two:

y₁ = W₁ · x + b₁

y₂ = W₂ · y₁ + b₂

Substitute y₁ into the second equation:

y₂ = W₂ · (W₁ · x + b₁) + b₂ = (W₂ · W₁) · x + (W₂ · b₁ + b₂)

That's just W_eff · x + b_eff — a single linear transformation. Two layers collapsed into one. Three layers? Same thing. A hundred layers? Still one effective matrix times one effective bias.

Hand Calculation: The Collapse in Action

Setup. Two 2×2 weight matrices, no bias, no activation. We'll multiply them together and show that two layers = one layer.

Layer 1 weights: W₁ = [[1, 2], [3, 4]]

Layer 2 weights: W₂ = [[0.5, -1], [1, 0.5]]

Compute W_eff = W₂ · W₁:

Row 0, Col 0: 0.5 × 1 + (-1) × 3 = 0.5 - 3.0 = -2.5
Row 0, Col 1: 0.5 × 2 + (-1) × 4 = 1.0 - 4.0 = -3.0
Row 1, Col 0: 1 × 1 + 0.5 × 3 = 1.0 + 1.5 = 2.5
Row 1, Col 1: 1 × 2 + 0.5 × 4 = 2.0 + 2.0 = 4.0

So W_eff = [[-2.5, -3.0], [2.5, 4.0]]. Feed in x = [1, 0]:

Two-layer path: Layer 1 → [1, 3]. Layer 2 → [0.5 × 1 + (-1) × 3, 1 × 1 + 0.5 × 3] = [-2.5, 2.5].
Single-matrix path: W_eff · [1, 0] = [-2.5, 2.5].

Identical. Two layers, one layer — same output. Depth bought us nothing. Now imagine stacking 10, 50, or 100 of these. It's still one matrix. The network has no more expressive power than a single layer, regardless of depth.

What Nonlinearity Buys You

Insert a nonlinear function σ between layers:

y₂ = W₂ · σ(W₁ · x + b₁) + b₂

Now you can't factor out a single W_eff. The σ breaks the linearity. Each additional layer adds a new "fold" or "bend" to the function the network can represent. With enough layers and nonlinearities, a neural network can approximate any continuous function — the universal approximation theorem.

The simulation below shows this dramatically. On the left, a linear network tries to classify points in two concentric circles. It can only draw a straight line. On the right, add ReLU between layers, and the decision boundary curves to fit the data.

Linear vs. Nonlinear Decision Boundaries

Left: linear network (no activations). Right: same architecture with ReLU. Toggle nonlinearity and adjust depth.

Layers 3

The difference is stark. With linearity, adding layers from 1 to 8 changes nothing — the boundary stays straight. With ReLU, more layers mean more expressive power: the boundary wraps tighter around the data.

python
import torch
import torch.nn as nn

# 10-layer linear network
linear_net = nn.Sequential(*[nn.Linear(2, 2) for _ in range(10)])

# Collapse all layers into one effective matrix
W_eff = torch.eye(2)
for layer in linear_net:
    W_eff = layer.weight @ W_eff
# W_eff is a SINGLE 2×2 matrix — 10 layers = 1 layer
print("10 layers collapsed to:", W_eff.shape)  # torch.Size([2, 2])

You might think "any nonlinearity will do." It won't. Sigmoid and tanh saturate for large inputs, crushing gradients to near-zero. ReLU kills neurons permanently when they output zero. The choice of activation function directly determines whether your gradients flow or die. The next chapters walk through each one — its strengths, its failure mode, and when to use it.

What happens if you stack 100 layers with no activation function between them?

The network becomes 100× more expressive than a single layer The network can approximate any function due to depth The network is mathematically equivalent to a single linear layer The gradients vanish, but the forward pass still benefits from depth

Chapter 1: Sigmoid & Tanh — The Classics

The first activation functions were biological. Neurons either fire or don't — a smooth step function from 0 to 1. That's sigmoid. Its centered cousin, tanh, maps to (-1, 1). Both were the backbone of neural networks from the 1980s to ~2012. Both have a fatal flaw.

To understand that flaw, we need to derive these functions from scratch, compute their gradients by hand, and watch what happens when you chain those gradients through a deep network.

Deriving Sigmoid

We want a smooth function that maps any real number to the range (0, 1). It should be near 0 for very negative inputs, near 1 for very positive inputs, and transition smoothly around zero. The logistic function does exactly this:

σ(x) = 1 / (1 + e^-x)

Why this particular form? The exponential e^-x is huge when x is very negative (pushing the denominator up, making σ near 0) and tiny when x is very positive (denominator ≈ 1, making σ near 1). The transition happens around x = 0, where e⁰ = 1, so σ(0) = 1/(1+1) = 0.5.

The derivative has an elegant form. Starting from σ(x) = (1 + e^-x)^-1:

σ'(x) = σ(x) · (1 - σ(x))

This is beautiful but dangerous. The derivative is the product of σ(x) and its complement. The maximum occurs when both factors are equal — at x = 0, where σ(0) = 0.5, giving σ'(0) = 0.5 × 0.5 = 0.25.

Read that again: the maximum possible gradient of sigmoid is 0.25. Not 1. Not 0.5. A quarter. And it only gets worse from there.

Hand Calculation: Sigmoid Values and Gradients

Trace sigmoid across five input values. We compute both the output and the gradient at each point to see exactly where gradients die.

x	e^-x	σ(x)	σ'(x) = σ(1-σ)	Status
-3	20.09	1/(1+20.09) = 0.047	0.047 × 0.953 = 0.045	Saturated low
-1	2.718	1/(1+2.718) = 0.269	0.269 × 0.731 = 0.197	Weak gradient
0	1.000	1/(1+1) = 0.500	0.500 × 0.500 = 0.250	Maximum gradient
1	0.368	1/(1+0.368) = 0.731	0.731 × 0.269 = 0.197	Weak gradient
3	0.050	1/(1+0.050) = 0.953	0.953 × 0.047 = 0.045	Saturated high

At x = ±3, the gradient is already 5.5× smaller than the maximum. At x = 5: σ(5) = 0.993, σ'(5) = 0.993 × 0.007 = 0.007. That's 35× smaller than the maximum.

Deriving Tanh

Sigmoid outputs are always positive (between 0 and 1). This means the next layer always receives positive inputs, which can cause zig-zagging during gradient descent. Tanh fixes this by centering the output around zero:

tanh(x) = (e^x - e^-x) / (e^x + e^-x)

The relationship between tanh and sigmoid is direct: tanh(x) = 2σ(2x) - 1. Tanh maps to (-1, 1) instead of (0, 1). Same S-shape, but centered at zero.

The derivative:

tanh'(x) = 1 - tanh²(x)

At x = 0: tanh(0) = 0, so tanh'(0) = 1 - 0 = 1.0. That's 4× better than sigmoid's 0.25 at the peak. But tanh still saturates: at x = 3, tanh(3) ≈ 0.995, giving tanh'(3) = 1 - 0.995² ≈ 0.010. At x = 5, tanh'(5) ≈ 0.00009. The gradient is essentially dead.

The Vanishing Gradient Chain

During backpropagation, the gradient at each layer is multiplied by the local activation gradient. In a 10-layer network with sigmoid activations, the gradient arriving at layer 1 is the product of 10 sigmoid derivatives.

Even at the maximum (0.25 per layer):

0.25¹⁰ = 0.25 × 0.25 × ... × 0.25 = 9.5 × 10^-7

That's less than one millionth of the original gradient. Layer 1's weights essentially stop learning. This is the vanishing gradient problem, and it's why networks deeper than ~5 layers were nearly impossible to train before 2012.

Layers	Sigmoid (0.25ⁿ)	Tanh best (1.0ⁿ)	Tanh typical (0.6ⁿ)
3	0.0156	1.0	0.216
5	0.00098	1.0	0.0778
10	9.5 × 10^-7	1.0	0.0060
20	9.1 × 10^-13	1.0	3.7 × 10^-5

The "tanh best" column is the theoretical maximum — all inputs at exactly zero. In practice, inputs wander away from zero, and tanh gradients shrink to ~0.6 or less. The "typical" column shows the realistic picture. Tanh is better than sigmoid, but still vanishes.

Sigmoid & Tanh: Curves and Gradient Chains

Toggle between sigmoid and tanh. Drag the input slider to see the gradient at each point. Below: watch the gradient chain shrink as layers increase.

Input x 0.0

Chain length 5

python
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_grad(x):
    s = sigmoid(x)
    return s * (1 - s)  # max = 0.25 at x=0

def tanh_grad(x):
    t = np.tanh(x)
    return 1 - t**2    # max = 1.0 at x=0

# Chain of 10 sigmoid gradients at x=0 (best case)
chain_10 = sigmoid_grad(0) ** 10
print(f"10-layer sigmoid chain: {chain_10:.2e}")  # 9.54e-07

# Chain of 10 tanh gradients at x=1 (realistic): 0.42^10
chain_10_tanh = tanh_grad(1) ** 10
print(f"10-layer tanh chain (x=1): {chain_10_tanh:.2e}")  # 1.71e-04
# (the table's "typical" column rounds to 0.6 per layer: 0.6**10 = 0.0060)

Sigmoid isn't "bad." It's still the right choice for binary classification output layers (where you WANT a probability in [0,1]) and for gates in LSTMs and GRUs (where the gate needs to smoothly interpolate between "fully open" and "fully closed"). The problem is using it as a hidden-layer activation in deep feedforward networks. Context matters — an activation function isn't universally good or bad; it's about where you put it.

Why does the sigmoid gradient vanish for large inputs?

Sigmoid outputs become negative, canceling the gradient The max sigmoid derivative is 0.25, and for large |x| it approaches 0 — multiplying many small gradients gives near-zero The sigmoid function overflows to infinity Sigmoid is not differentiable at large values

Chapter 2: ReLU — The Revolution

In 2012, a simple insight changed everything. What if the activation function was just... a ramp? No exponentials, no saturation, no squashing. For positive inputs, pass them through unchanged. For negative inputs, output zero. That's ReLU — the Rectified Linear Unit — and it made deep learning possible.

ReLU(x) = max(0, x)

That's the entire definition. One line. No parameters, no exponentials, no division. Just a comparison and a zero. It's the simplest nonlinear function you can imagine, and it solved the vanishing gradient problem that had plagued neural networks for decades.

Why ReLU Works

The gradient of ReLU is:

ReLU'(x) = 1 if x > 0, 0 if x ≤ 0

For positive inputs, the gradient is exactly 1. Not 0.25 like sigmoid. Not 0.6 like typical tanh. Exactly 1. This means gradients flow through active ReLU neurons without any shrinkage.

In a 100-layer network where all neurons are active (positive inputs), the gradient chain is:

1 × 1 × 1 × ... × 1 = 1¹⁰⁰ = 1

Compare with sigmoid:

0.25 × 0.25 × ... × 0.25 = 0.25¹⁰⁰ ≈ 6 × 10^-61

That's not a typo. 6 × 10^-61. ReLU gives you a gradient of 1; sigmoid gives you something smaller than the number of atoms in the observable universe (10^-80 territory for deeper networks). This is why AlexNet (2012), the model that launched the deep learning era, used ReLU — and everything after it followed suit.

Hand Calculation: ReLU vs. Sigmoid Gradients

Compare gradient flow through 10 layers. We trace the gradient for both activations at a realistic input value.

x	ReLU(x)	ReLU'(x)	σ(x)	σ'(x)
-2	0	0	0.119	0.105
-1	0	0	0.269	0.197
0	0	0 or 1	0.500	0.250
0.5	0.5	1	0.622	0.235
1	1	1	0.731	0.197
3	3	1	0.953	0.045

Now chain 10 of these gradients together. Assume all neurons are active (x > 0):

ReLU: 1¹⁰ = 1.0. Perfect gradient flow.
Sigmoid (best case, x=0): 0.25¹⁰ = 9.5 × 10^-7.
Sigmoid (x=1): 0.197¹⁰ = 7.2 × 10^-8.

Six to eight orders of magnitude difference. This is why ReLU enabled deep networks. Gradients that actually reach the early layers mean early layers actually learn.

The Dead Neuron Problem

ReLU's gradient for x ≤ 0 is exactly zero. If a neuron's input becomes negative — say, due to a large negative bias or an unlucky weight update — it outputs zero, its gradient is zero, its weights never update, and it stays at zero forever. The neuron is permanently dead.

This isn't rare. In practice, 10-40% of neurons in a ReLU network can die during training. The risk factors:

High learning rate: Large weight updates can push a neuron's bias strongly negative, killing it in one step.
Poor initialization: If initial biases are negative, neurons start dead and never recover.
Unlucky data: A batch of inputs that happens to produce negative pre-activations can permanently kill neurons.

Think of it this way: sigmoid neurons get "sleepy" (vanishing gradients slow learning), but ReLU neurons can "die" (zero gradient means zero learning, forever). Sleepy neurons can eventually wake up if the gradient signal is strong enough. Dead neurons cannot.

ReLU vs. Sigmoid: Network Health

Top: ReLU and sigmoid curves with gradient overlay. Bottom: a grid of 64 neurons trained step-by-step. With ReLU, watch neurons die (go dark). With sigmoid, all survive but gradients fade. Adjust learning rate to see the effect.

Learning rate 0.10

python
import numpy as np

def relu(x):
    return np.maximum(0, x)  # one line — the simplest activation

def relu_grad(x):
    return (x > 0).astype(float)  # 1 if positive, 0 if not

# Compare gradient chains
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

x = 1.0
relu_chain_10 = relu_grad(x) ** 10
sig_chain_10 = (sigmoid(x) * (1 - sigmoid(x))) ** 10

print(f"ReLU  10-layer chain: {relu_chain_10:.1f}")   # 1.0
print(f"Sigmoid 10-layer chain: {sig_chain_10:.2e}")  # 7.18e-08

ReLU isn't differentiable at x = 0. In practice, this doesn't matter. We just pick either 0 or 1 as the gradient at x = 0 (convention: 0). Neural networks are trained with stochastic gradient descent on mini-batches — the probability of any single input landing at exactly 0.000... is essentially zero. The mathematical non-differentiability at a single point has zero practical impact. Don't let a calculus technicality scare you away from the most important activation function in deep learning history.

What is the "dying ReLU" problem?

ReLU outputs become infinitely large for positive inputs When a neuron's input is permanently negative, ReLU outputs 0 with gradient 0, so the neuron's weights never update and it's effectively dead ReLU is not differentiable, causing numerical errors ReLU squashes gradients to 0.25 like sigmoid

Chapter 3: Leaky ReLU & ELU — Fixing Dead Neurons

Dead neurons waste capacity. If 30% of your network's neurons are dead, you're paying for a 30% bigger network than you're actually using. Compute, memory, parameters — all wasted on neurons that output zero forever and will never learn again.

The fix is elegantly simple: instead of outputting exactly zero for negative inputs, let a small signal through. A tiny leak. That's Leaky ReLU.

Deriving Leaky ReLU

LeakyReLU(x) = x if x > 0, αx if x ≤ 0

Where α is a small positive constant, typically 0.01. For positive inputs, it behaves exactly like ReLU — gradient of 1, no saturation. For negative inputs, instead of a flat zero, the output is a gently sloping line with slope α.

The gradient:

LeakyReLU'(x) = 1 if x > 0, α if x ≤ 0

For negative inputs, the gradient is α = 0.01. Tiny, but nonzero. Dead neurons become "drowsy" neurons — they can still receive gradient signal and eventually wake up. A neuron that got pushed into negative territory by a bad weight update can slowly recover, because 0.01 of the gradient still flows through.

Deriving ELU

Exponential Linear Unit (ELU) takes a different approach to the negative region. Instead of a straight line, it uses an exponential curve:

ELU(x) = x if x > 0, α(e^x - 1) if x ≤ 0

For large negative x, e^x approaches 0, so ELU approaches -α. The function has a smooth asymptote at -α for the negative side, while being identical to ReLU for positive inputs.

The gradient:

ELU'(x) = 1 if x > 0, α · e^x if x ≤ 0

Key difference from Leaky ReLU: the ELU gradient for very negative inputs approaches 0 (not α). At x = -1, the gradient is α · e^-1 ≈ 0.368α. At x = -5, it's α · e^-5 ≈ 0.0067α. This provides soft saturation — extremely negative inputs get suppressed, acting as a noise filter, while moderately negative inputs still pass gradient.

Hand Calculation: Comparing the Variants

Compute all three activations and gradients. Leaky ReLU uses α = 0.01, ELU uses α = 1.0. We trace five input values.

x	ReLU	ReLU'	Leaky	Leaky'	ELU	ELU'
-3	0	0	-0.03	0.01	-0.950	0.050
-1	0	0	-0.01	0.01	-0.632	0.368
0	0	0	0	1	0	1
1	1	1	1	1	1	1
3	3	1	3	1	3	1

Let's verify the ELU values at x = -3 step by step:

e^-3 = 0.04979
ELU(-3) = 1.0 × (0.04979 - 1) = -0.950
ELU'(-3) = 1.0 × e^-3 = 0.050

Compare that gradient of 0.050 with ReLU's 0. The ELU neuron at x = -3 is still learning — slowly, but it hasn't died. And at x = -1, ELU's gradient is 0.368 — quite healthy. Leaky ReLU's gradient at x = -1 is a constant 0.01 regardless of how negative the input gets. ELU gives stronger gradients near zero and weaker ones far away.

PReLU — Learning the Slope

Parametric ReLU (PReLU), proposed by He et al. (2015), is Leaky ReLU where α is a learnable parameter. Instead of fixing α = 0.01, the network decides how much negative signal to let through during training.

PReLU(x) = x if x > 0, αx if x ≤ 0 (α learned via backprop)

In their ImageNet experiments, He et al. found that the learned α values varied by layer. Early layers learned larger α (more negative signal preserved), while later layers learned smaller α. The network was automatically tuning the activation function per layer — something hand-tuning could never achieve efficiently.

PReLU added only one parameter per channel (or per layer), so the overhead is negligible. It improved ImageNet top-5 error by ~0.5% over standard ReLU — small in absolute terms, but significant at the frontier of accuracy at the time.

Activation Function Comparison

All three functions on the same axes. Adjust α to see how the negative region changes. Below: gradient heatmaps showing gradient strength across the input range. Further below: the dead neuron counter from Ch2 — compare ReLU vs. Leaky ReLU vs. ELU.

α 0.010

python
import numpy as np
import torch
import torch.nn as nn

# From scratch
def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

def elu(x, alpha=1.0):
    return np.where(x > 0, x, alpha * (np.exp(x) - 1))

# PyTorch equivalents
leaky = nn.LeakyReLU(negative_slope=0.01)
elu_layer = nn.ELU(alpha=1.0)
prelu = nn.PReLU(num_parameters=1)  # alpha is learnable!

x = torch.linspace(-3, 3, 7)
print("Leaky:", leaky(x).data)
print("ELU:  ", elu_layer(x).data)
print("PReLU:", prelu(x).data)
print("PReLU alpha:", prelu.weight.item())  # initial: 0.25

You might think Leaky ReLU is always better than ReLU since it fixes dead neurons. In practice, the difference is often negligible for well-initialized networks with reasonable learning rates. ReLU's simplicity — one comparison, no multiply for the positive branch — gives it a slight speed advantage on GPUs. The dead neuron problem is real but often overstated. Most networks work fine with regular ReLU unless learning rates are very high or initialization is poor. Leaky ReLU and ELU are your fallback when you do see dying neurons in your training logs — not the default first choice.

How does Leaky ReLU prevent dead neurons?

It clips negative inputs to a minimum value of -1 It assigns a small nonzero gradient (α, typically 0.01) to negative inputs, so weights can still update It randomly resets dead neurons during training It uses batch normalization to keep inputs positive

Chapter 4: GELU — The Transformer's Choice

ReLU makes a binary decision: positive inputs pass, negative inputs die. But what if instead of a hard gate, we used a soft one? What if the probability of passing an input through depended on how large it is?

Inputs of +3 almost certainly pass. Inputs of -3 almost certainly get zeroed. Inputs near 0 get a coin flip. That probabilistic interpretation is GELU — the Gaussian Error Linear Unit.

Proposed by Hendrycks and Gimpel in 2016, GELU became the default activation for BERT, GPT-2, GPT-3, and Vision Transformer (ViT). If you use a transformer today, you're almost certainly using GELU somewhere inside it.

The Stochastic Regularization Interpretation

Here's where GELU comes from. Imagine you have an input x to a neuron. Instead of passing it through directly, you multiply it by a random mask — a Bernoulli random variable that's either 0 or 1. If the mask is 1, the input passes. If 0, it's dropped. Sound familiar? That's dropout.

But dropout uses a fixed probability (like 0.1). What if the dropout probability depended on the value of x itself? Specifically: the probability that the input passes is Φ(x), the standard normal CDF — the probability that a draw from a standard normal distribution is less than or equal to x.

Large positive x? Almost all of the normal distribution is below you, so Φ(x) ≈ 1 — you pass. Large negative x? Almost none is below you, so Φ(x) ≈ 0 — you're dropped. Near zero? Φ(0) = 0.5 — a coin flip.

The expected value of this stochastic process IS GELU. For input x, the mask is Bernoulli(Φ(x)). The expected output is: E[x · Bernoulli(Φ(x))] = x · Φ(x). That's the entire GELU formula. No curve fitting, no heuristics — just the expected value of a probabilistic gate.

The Formula

GELU(x) = x · Φ(x) = x · 0.5 · (1 + erf(x / √2))

Where erf is the Gauss error function. Since erf can be expensive to compute, there's a fast approximation used in practice:

GELU(x) ≈ 0.5x · (1 + tanh(√(2/π) · (x + 0.044715x³)))

The tanh approximation is what frameworks like PyTorch use when you pass approximate='tanh'. The exact version uses erf directly. Both produce nearly identical results — the max difference is about 0.0003.

Hand Calculation: GELU at Five Points

Let's compute GELU by hand at five inputs. We need Φ(x) — the standard normal CDF — which you can look up in a Z-table or compute from erf.

x = -2:

Φ(-2) = 0.0228 (only 2.28% of the normal distribution is below -2)
GELU(-2) = -2 × 0.0228 = -0.0456
ReLU(-2) = 0. GELU lets a tiny negative signal through.

x = -1:

Φ(-1) = 0.1587 (about 16% passes)
GELU(-1) = -1 × 0.1587 = -0.1587
ReLU(-1) = 0. GELU passes a noticeable negative value.

x = 0:

Φ(0) = 0.5 (half the distribution is below zero)
GELU(0) = 0 × 0.5 = 0
Same as ReLU. Both zero at the origin.

x = 1:

Φ(1) = 0.8413 (about 84% passes)
GELU(1) = 1 × 0.8413 = 0.8413
ReLU(1) = 1. GELU is more conservative — it attenuates slightly.

x = 2:

Φ(2) = 0.9772 (about 98% passes)
GELU(2) = 2 × 0.9772 = 1.9544
ReLU(2) = 2. GELU almost matches — the gate is nearly fully open.

Pattern: For positive inputs, GELU < ReLU (the gate isn't fully open). For negative inputs, GELU ≠ 0 (the gate isn't fully closed). GELU interpolates smoothly between "pass" and "suppress" — there's no sharp corner at zero.

Why GELU Won for Transformers

Three properties made GELU the transformer default:

1. Smooth everywhere. GELU has continuous derivatives of all orders. ReLU has a discontinuous first derivative at x=0. This smoothness means gradients never have sudden jumps, which helps optimizers like Adam maintain stable momentum estimates.

2. Non-monotonic near zero. GELU has a small dip below zero near x ≈ -0.75, where GELU(x) ≈ -0.17. This means the function isn't strictly increasing — it actually decreases slightly for small negative inputs before flattening to zero. This non-monotonicity acts as a form of built-in regularization.

3. Stochastic regularization. Because GELU can be interpreted as expected dropout, it provides an implicit regularization effect during forward passes. Empirically, BERT trained with GELU converges faster and generalizes better than with ReLU.

Interactive: GELU vs ReLU

GELU Explorer

Drag the slider to move a probe along the x-axis. Top: GELU vs ReLU curves. Middle: the gate probability Φ(x). Bottom: gradients compared.

x 0.00

Code: GELU from Scratch

python
import torch
import math

# Exact GELU using the error function
def gelu_exact(x):
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))

# Fast tanh approximation (what PyTorch uses internally)
def gelu_tanh(x):
    return 0.5 * x * (1.0 + torch.tanh(
        math.sqrt(2.0 / math.pi) * (x + 0.044715 * x**3)
    ))

# Verify against PyTorch built-in
x = torch.linspace(-3, 3, 7)
print("Exact:  ", gelu_exact(x))
print("Approx: ", gelu_tanh(x))
print("PyTorch:", torch.nn.functional.gelu(x))
# Max difference between exact and approx: ~0.0003

GELU is NOT just "smooth ReLU." It's non-monotonic — it has a dip below zero near x ≈ -0.75, where GELU(x) ≈ -0.17. ReLU is always non-negative. This non-monotonicity means GELU can output negative values for slightly negative inputs, providing a richer gradient signal than ReLU's flat zero. The non-monotonicity isn't a bug — it's the regularization.

What gives GELU its "soft gating" behavior?

It clips negative values to a small constant like -0.01 (Leaky ReLU style) It uses a learnable parameter to control the sharpness of the gate It multiplies the input by Φ(x) — the probability that a standard normal is ≤ x — so larger inputs are more likely to pass It applies a sigmoid function after ReLU to smooth the output

Chapter 5: SiLU/Swish — The Self-Gated Activation

In 2017, Google Brain tried something unusual. Instead of designing an activation function by hand, they used neural architecture search — an AI designing AI components. They searched over a space of simple mathematical operations and evaluated each candidate on real tasks.

The winner was remarkably simple: multiply the input by its own sigmoid. They called it Swish.

Swish(x) = x · σ(x) = x / (1 + e^-x)

That's it. The sigmoid σ(x) acts as a gate: for large positive x, σ(x) ≈ 1 so Swish(x) ≈ x (identity). For large negative x, σ(x) ≈ 0 so Swish(x) ≈ 0 (suppression). Near zero, the sigmoid gives a smooth interpolation.

The β Parameter

The full Swish formula includes a learnable (or fixed) parameter β:

Swish_β(x) = x · σ(βx)

This parameter controls the sharpness of the gate:

β = 0: σ(0) = 0.5 for all x, so Swish(x) = x/2. A straight line — purely linear.
β = 1: Standard Swish. The sigmoid gate varies smoothly with x. This is the most common setting.
β → ∞: σ(βx) approaches a step function — 0 for x < 0, 1 for x > 0. Swish converges to ReLU.

Swish interpolates between a linear function and ReLU. At β=0, it's linear (x/2). At β=∞, it's ReLU. At β=1, it's somewhere in between — smooth, non-monotonic, and just right for most tasks.

The Gradient — Never Zero

The Swish gradient is more complex than ReLU's but has a critical advantage: it's never zero for any finite input.

Swish'(x) = σ(x) + x · σ(x) · (1 - σ(x)) = σ(x) · (1 + x · (1 - σ(x)))

At x = 0: σ(0) = 0.5, so Swish'(0) = 0.5 · (1 + 0) = 0.5. Compare this to ReLU, whose gradient jumps from 0 to 1 at x=0. Swish has no discontinuity — a smooth transition through every input value.

Even for large negative x (like x = -10), σ(-10) ≈ 0.0000454, which is tiny but nonzero. Swish neurons never fully die — there's always a small gradient signal to nudge them back to life.

Hand Calculation: Swish at Five Points

Using β = 1 (standard SiLU). We need σ(x) = 1/(1 + e^-x).

x = -2:

σ(-2) = 1/(1 + e²) = 1/8.389 = 0.119
Swish(-2) = -2 × 0.119 = -0.238

x = -1:

σ(-1) = 1/(1 + e¹) = 1/3.718 = 0.269
Swish(-1) = -1 × 0.269 = -0.269

x = 0:

σ(0) = 0.5
Swish(0) = 0 × 0.5 = 0

x = 1:

σ(1) = 1/(1 + e^-1) = 1/1.368 = 0.731
Swish(1) = 1 × 0.731 = 0.731

x = 2:

σ(2) = 1/(1 + e^-2) = 1/1.135 = 0.881
Swish(2) = 2 × 0.881 = 1.762

The minimum of Swish occurs at x ≈ -1.28, where Swish(x) ≈ -0.278. This is a deeper dip than GELU's minimum of -0.17. Both are non-monotonic, but Swish allows larger negative outputs.

SiLU — The Same Thing, Different Name

SiLU (Sigmoid Linear Unit) is exactly Swish with β = 1. The name was proposed independently by Elfwing et al. in 2018. PyTorch uses torch.nn.SiLU(). You'll see both names — they're the same function.

Where SiLU/Swish appears in the wild:

Model	Where	Year
EfficientNet	All conv layers	2019
LLaMA / LLaMA 2	FFN gate (via SwiGLU)	2023
Mistral / Mixtral	FFN gate (via SwiGLU)	2023
Stable Diffusion	U-Net conv blocks	2022
Gemma	FFN gate (via GeGLU)	2024

Interactive: Swish with Adjustable β

Swish β Explorer

Slide β from 0 (linear) to 5 (near-ReLU). Watch the curve morph. Below: gradient comparison with ReLU and GELU.

β 1.00

Code: SiLU/Swish from Scratch

python
import torch

# SiLU / Swish (beta=1)
def silu(x):
    return x * torch.sigmoid(x)

# General Swish with adjustable beta
def swish(x, beta=1.0):
    return x * torch.sigmoid(beta * x)

# Gradient of SiLU (for understanding)
def silu_grad(x):
    s = torch.sigmoid(x)
    return s * (1 + x * (1 - s))

# Verify against PyTorch built-in
x = torch.linspace(-3, 3, 7)
print("Ours:   ", silu(x))
print("PyTorch:", torch.nn.functional.silu(x))
# Identical — SiLU IS Swish with beta=1

# Beta sweep: watch Swish morph from linear to ReLU
for beta in [0, 0.5, 1, 2, 5, 20]:
    y = swish(torch.tensor(1.0), beta)
    print(f"beta={beta:4}  Swish(1)={y:.4f}")
# beta=0 → 0.5, beta=1 → 0.731, beta=20 → 1.000 (≈ReLU)

Swish and GELU look very similar and have nearly identical performance. The practical difference is negligible for most tasks. GELU is standard in NLP transformers (BERT, GPT, ViT) because it was adopted first. SiLU/Swish is standard in vision models (EfficientNet) and shows up in LLM FFNs via SwiGLU. Don't agonize over the choice — either works.

What happens to Swish as β increases toward infinity?

It becomes a constant function (always outputs 0) It approaches ReLU — the sigmoid gate becomes a hard step function It becomes the identity function f(x) = x It oscillates increasingly rapidly around zero

Chapter 6: Gated Linear Units — SwiGLU & GeGLU

Every transformer has two main components per layer: attention and a feed-forward network (FFN). We've been talking about what activation goes inside the FFN. But what if the FFN's entire structure changed?

What if instead of one path through an activation, you had two paths — one for content and one for deciding what to keep? That's the Gated Linear Unit (GLU).

The Standard FFN — One Path

In the original transformer (Vaswani 2017), each layer's FFN is:

Input x

[batch, seq, d_model]

↓

W₁ · x + b₁

Project up: d_model → d_ff (usually 4× d_model)

↓

ReLU / GELU

Apply activation element-wise

↓

W₂ · h + b₂

Project down: d_ff → d_model

↓

Output

[batch, seq, d_model]

Two weight matrices. One activation. One path. Simple. The activation (ReLU or GELU) decides which features to suppress. But it makes that decision based only on the magnitude of each value independently.

The GLU — Two Paths

A Gated Linear Unit splits the FFN into two parallel projections:

Input x

[batch, seq, d_model]

↓ (split into two paths)

W_gate · x

Gate path: d_model → d_ff

W_up · x

Content path: d_model → d_ff

↓ apply activation to gate path

σ(W_gate · x) ⊙ (W_up · x)

Element-wise multiply: gate controls what content passes

↓

W_down · h

Project down: d_ff → d_model

↓

Output

[batch, seq, d_model]

The key insight: the gate path and the content path are different linear projections of the same input. The gate learns which features to keep. The content learns what values to produce. The element-wise product lets the network learn feature selection — a much richer operation than applying an activation function element-wise.

Think of it like an audio mixing board. The content path produces all the audio tracks. The gate path is a row of faders — each one independently controlling the volume of one track. The activation function on the gate (SiLU, GELU, sigmoid) determines how the faders behave. Without gating, each track can only be "on" or "off" based on its own volume. With gating, a quiet track can be amplified and a loud track can be muted — based on the full context of the input.

GLU Variants

The activation function on the gate path defines the GLU variant:

Variant	Gate Activation	Formula	Used In
GLU	σ(x) (sigmoid)	σ(xW_g) ⊙ xW_up	Original (Dauphin 2017)
SwiGLU	SiLU/Swish	SiLU(xW_g) ⊙ xW_up	LLaMA, Mistral, PaLM
GeGLU	GELU	GELU(xW_g) ⊙ xW_up	Gemma, some T5 variants
ReGLU	ReLU	ReLU(xW_g) ⊙ xW_up	Experimental

SwiGLU (Shazeer, 2020) emerged as the winner. In comprehensive experiments across language modeling benchmarks, SwiGLU beat all other GLU variants and all non-gated FFNs. Google adopted it for PaLM. Meta adopted it for LLaMA. Now every major open-weight LLM uses SwiGLU.

The 2/3 Rule — Free Gating

You might worry that GLU adds parameters. After all, it has three weight matrices (W_gate, W_up, W_down) instead of two (W₁, W₂). Let's count.

Standard FFN:

W₁: d_model × d_ff parameters
W₂: d_ff × d_model parameters
Total: 2 × d_model × d_ff

GLU FFN with hidden dimension d':

W_gate: d_model × d' parameters
W_up: d_model × d' parameters
W_down: d' × d_model parameters
Total: 3 × d_model × d'

To match parameter counts: 3 × d_model × d' = 2 × d_model × d_ff. Solve: d' = 2/3 × d_ff.

The 2/3 rule: Set the GLU hidden dimension to 2/3 of the standard FFN hidden dimension. You get the gating mechanism for free — same total parameter count, better expressiveness. In practice, LLaMA uses d_ff = (2/3) × 4 × d_model ≈ 2.67 × d_model, rounded to the nearest multiple of 256.

Hand Calculation: SwiGLU Step by Step

Let's trace a 4D input through both a standard FFN and a SwiGLU FFN. d_model = 4.

Input: x = [0.5, -1.0, 0.3, 0.8]

Standard FFN (d_ff = 8, so W₁ is 4×8):

Suppose after W₁, the 8D hidden vector is: h = [1.2, -0.8, 0.5, -1.5, 2.1, 0.3, -0.1, 0.9]

After ReLU: [1.2, 0, 0.5, 0, 2.1, 0.3, 0, 0.9]. Three features killed outright — the ReLU decided independently for each.

SwiGLU FFN (d' = 5 to match params — since 3×4×5 = 60 ≈ 2×4×8 = 64):

Gate path (W_gate · x): [0.8, -1.2, 0.4, -0.3, 1.5]

Content path (W_up · x): [1.1, 0.7, -0.9, 0.5, 0.2]

Apply SiLU to gate path:

SiLU(0.8) = 0.8 × σ(0.8) = 0.8 × 0.690 = 0.552
SiLU(-1.2) = -1.2 × σ(-1.2) = -1.2 × 0.232 = -0.278
SiLU(0.4) = 0.4 × σ(0.4) = 0.4 × 0.599 = 0.240
SiLU(-0.3) = -0.3 × σ(-0.3) = -0.3 × 0.426 = -0.128
SiLU(1.5) = 1.5 × σ(1.5) = 1.5 × 0.818 = 1.227

Element-wise multiply (gate ⊙ content):

0.552 × 1.1 = 0.607 (gate passes most of the content)
-0.278 × 0.7 = -0.195 (gate partially inverts and suppresses)
0.240 × -0.9 = -0.216 (gate passes, content is negative)
-0.128 × 0.5 = -0.064 (gate nearly closes this channel)
1.227 × 0.2 = 0.245 (gate wide open, but content is small)

Hidden after gating: [0.607, -0.195, -0.216, -0.064, 0.245]. Then W_down projects back to d_model = 4.

Notice: no features were killed. The gate modulated each value based on learned context, rather than making a binary pass/kill decision. Feature 4 has a small gate but small content too. Feature 1 has a negative gate that partially inverts the content. This is richer than ReLU could ever be.

SHOWCASE: FFN Architecture Arena

FFN Architecture Comparison

Toggle between architectures to see data flow. Click "Compute!" to animate a sample input through the network. Adjust dimensions to see how parameter counts change.

d_model 128

Code: SwiGLU FFN from Scratch

python
import torch
import torch.nn as nn

class StandardFFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff, bias=False)
        self.w2 = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x):
        return self.w2(torch.relu(self.w1(x)))

class SwiGLUFFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        # 2/3 rule: hidden_dim = 2/3 * d_ff
        hidden = int(2 * d_ff / 3)
        hidden = hidden + (256 - hidden % 256) % 256  # round up to 256
        self.w_gate = nn.Linear(d_model, hidden, bias=False)
        self.w_up   = nn.Linear(d_model, hidden, bias=False)
        self.w_down = nn.Linear(hidden, d_model, bias=False)

    def forward(self, x):
        gate = nn.functional.silu(self.w_gate(x))  # SiLU = Swish
        up   = self.w_up(x)
        return self.w_down(gate * up)             # element-wise gating

# Parameter count comparison
d, ff = 4096, 4 * 4096
std = StandardFFN(d, ff)
glu = SwiGLUFFN(d, ff)
std_p = sum(p.numel() for p in std.parameters())
glu_p = sum(p.numel() for p in glu.parameters())
print(f"Standard FFN: {std_p:,} params")  # 33,554,432
print(f"SwiGLU FFN:   {glu_p:,} params")  # ~33,816,576 (close!)

Why SwiGLU beat everything else. Shazeer (2020) tested 8 GLU variants on language modeling perplexity. SwiGLU won consistently. The reason isn't just the activation choice — it's the gating structure. Having two separate projections lets the network learn which features to amplify and which to suppress, based on the full input context. Standard FFNs can only suppress features based on each feature's own magnitude. That's the real power of GLU — learned feature selection at every layer.

Chapter 7: Mish & The Unifying Pattern

Notice a pattern? GELU multiplies the input by its normal CDF. Swish multiplies the input by its sigmoid. Every modern activation function is the same template: x times some smooth gate function. Mish continues this pattern, and understanding the template is more valuable than memorizing individual formulas.

The x · gate(x) Framework

Every modern activation function can be written as:

f(x) = x · g(x)

Where g: ℝ → [0, 1] is a smooth function that approaches 1 for large positive x and 0 for large negative x. The x· prefix ensures the function behaves like the identity for large positive inputs. The gate ensures suppression for large negative inputs. Different gate functions give different activations:

Activation	Gate g(x)	Formula
SiLU/Swish	σ(x) = 1/(1+e^-x)	x · σ(x)
GELU	Φ(x) = 0.5(1+erf(x/√2))	x · Φ(x)
Mish	tanh(softplus(x))	x · tanh(ln(1+e^x))

All three gates have the same shape: a smooth S-curve from 0 to 1. They differ in exactly where they transition and how quickly, but the overall behavior is nearly identical.

Once you see the template, you understand all modern activations. Stop memorizing individual formulas. Instead, remember: f(x) = x · (smooth gate from 0 to 1). The only question is which gate — and in practice, the choice barely matters.

Mish — The Third Member

Proposed by Diganta Misra in 2019, Mish uses a gate built from two familiar pieces: softplus and tanh.

Mish(x) = x · tanh(softplus(x)) = x · tanh(ln(1 + e^x))

Let's unpack this from the inside out:

softplus(x) = ln(1 + e^x) — a smooth approximation of ReLU. For large positive x, softplus(x) ≈ x. For large negative x, softplus(x) ≈ 0. At x=0, softplus(0) = ln(2) ≈ 0.693.

tanh squashes its input to the range [-1, 1]. Since softplus is always non-negative, tanh(softplus(x)) is always in [0, 1] — a valid gate. For large positive x: softplus ≈ x, tanh(x) ≈ 1, so the gate ≈ 1. For large negative x: softplus ≈ 0, tanh(0) = 0, so the gate ≈ 0. Exactly the behavior we need.

Hand Calculation: Mish Step by Step

x = -2:

softplus(-2) = ln(1 + e^-2) = ln(1 + 0.135) = ln(1.135) = 0.127
tanh(0.127) = 0.126
Mish(-2) = -2 × 0.126 = -0.253

x = -1:

softplus(-1) = ln(1 + e^-1) = ln(1.368) = 0.313
tanh(0.313) = 0.303
Mish(-1) = -1 × 0.303 = -0.303

x = 0:

softplus(0) = ln(1 + 1) = ln(2) = 0.693
tanh(0.693) = 0.600
Mish(0) = 0 × 0.600 = 0

x = 1:

softplus(1) = ln(1 + e) = ln(3.718) = 1.313
tanh(1.313) = 0.864
Mish(1) = 1 × 0.864 = 0.864

x = 2:

softplus(2) = ln(1 + e²) = ln(8.389) = 2.127
tanh(2.127) = 0.972
Mish(2) = 2 × 0.972 = 1.943

Compare the three activations at the same inputs:

x	GELU	SiLU	Mish	ReLU
-2	-0.046	-0.238	-0.253	0
-1	-0.159	-0.269	-0.303	0
0	0	0	0	0
1	0.841	0.731	0.864	1
2	1.954	1.762	1.943	2

The differences are small — concentrated in the range x ∈ [-2, 0]. For positive inputs, all three converge toward ReLU. Mish sits between GELU and SiLU for negative values but is very close to GELU for positive ones.

Where Mish Shows Up

Mish gained adoption primarily in computer vision:

YOLOv4 (2020) — replaced Leaky ReLU with Mish in the backbone, gaining ~1% mAP.
YOLOv5 — continued using Mish in certain configurations.
CSPDarknet — the backbone architecture paired with Mish.

In NLP/LLMs, Mish never gained traction — GELU and SiLU were already established. But Mish proved that the x·gate(x) template is robust: you can swap in different gates and get similar performance.

Interactive: All Modern Activations

Activation Comparison Arena

Toggle each activation on/off. Top: function curves. Bottom: difference from ReLU — the deviations are tiny and concentrated near zero.

Code: The Unified Template

python
import torch
import torch.nn.functional as F
import math

# The x * gate(x) template
def gated_activation(x, gate_fn):
    """All modern activations: f(x) = x * gate(x)"""
    return x * gate_fn(x)

# Different gates
def sigmoid_gate(x):    return torch.sigmoid(x)       # → SiLU
def phi_gate(x):        return 0.5 * (1 + torch.erf(x / math.sqrt(2)))  # → GELU
def mish_gate(x):       return torch.tanh(F.softplus(x))  # → Mish

# All three from the same template
x = torch.linspace(-3, 3, 100)
silu_out = gated_activation(x, sigmoid_gate)
gelu_out = gated_activation(x, phi_gate)
mish_out = gated_activation(x, mish_gate)

# How different are they really?
print("Max |GELU - SiLU|:", (gelu_out - silu_out).abs().max().item())  # ~0.12
print("Max |GELU - Mish|:", (gelu_out - mish_out).abs().max().item())  # ~0.03
print("Max |SiLU - Mish|:", (silu_out - mish_out).abs().max().item())  # ~0.09
# Tiny differences — all three are nearly interchangeable

The differences between GELU, SiLU, and Mish are tiny. On most benchmarks, they're within 0.1-0.3% of each other. Don't chase activation function benchmarks — the choice of GELU vs SiLU vs Mish matters far less than learning rate, batch size, or model architecture. Pick what your framework or model family uses and move on.

What unifying pattern do GELU, SiLU/Swish, and Mish all share?

They all use ReLU as an internal component and just smooth it differently They all require a learnable β parameter to control gate sharpness They all follow f(x) = x · g(x), where g(x) is a smooth gate approaching 1 for large positive x and 0 for large negative x They all clip outputs to the range [-1, 1] to prevent gradient explosion

Chapter 8: The Arena

We've studied each activation function in isolation. Now let's put them all on the same axes and watch them compete. The simulation below plots every function we've covered — with its gradient — so you can see at a glance how they differ in the regions that matter most: near zero, deep negative, and far positive.

The key tradeoffs become visible immediately. Sigmoid and tanh saturate on both sides. ReLU is dead on the left but perfectly linear on the right. GELU and SiLU gently curve through zero, allowing a small negative region. Mish is the smoothest of all. And SwiGLU isn't shown directly because it's a gated mechanism applied to two streams — you saw it in Chapter 6.

What to Look For

Gradient at x = 0: Sigmoid peaks at 0.25. Tanh, ReLU, and all modern variants hit or approach 1.0. This single fact explains why sigmoid kills deep networks.
Negative region: ReLU is flat zero. Leaky ReLU is a faint line. GELU and SiLU dip slightly negative before returning to zero — they allow a small negative signal. This matters more than it sounds: a neuron that briefly outputs a small negative can still contribute useful information.
Smoothness at x = 0: ReLU has a kink (non-differentiable at exactly zero). GELU, SiLU, and Mish are perfectly smooth. Smooth activations produce smoother loss landscapes, which optimizers navigate more efficiently.
Far-positive behavior: All modern activations converge to the identity (slope 1) for large positive x. The gradient is 1, meaning no saturation on the right side. The war is fought in the negative region and around zero.

Activation Function Arena

All functions on the same axes. Toggle each one on/off. Top panel: function values. Bottom panel: gradients (derivatives). Drag the x-marker to read exact values.

Probe x 0.00

The Gradient Story

Look at the bottom panel (gradients) with all functions enabled. The picture tells the entire story of activation function evolution:

Sigmoid (1986): gradient peaks at 0.25 and vanishes on both sides. A ceiling that limits depth.
Tanh (1991): gradient peaks at 1.0 but still vanishes on both sides. Better, but deep networks still struggle.
ReLU (2012): gradient is exactly 1 for x > 0, exactly 0 for x < 0. Binary: alive or dead. Enabled networks of 100+ layers but kills neurons.
Leaky ReLU (2013): gradient is 1 for x > 0, 0.01 for x < 0. No dead neurons, but the negative gradient is tiny.
GELU (2016): gradient smoothly transitions through zero, reaching ~1 for large positive x. Used in BERT, GPT-2, ViT.
SiLU/Swish (2017): nearly identical to GELU in practice. The gradient has a small bump above 1.0 around x ≈ 1.1. Used in EfficientNet.
Mish (2019): the smoothest gradient curve. Barely distinguishable from SiLU in practice, but its second derivative is smoother.

The trend is clear: each generation produced smoother, more gradient-friendly activations. The field converged on functions that are smooth at zero, have gradient near 1 for positive inputs, and allow a small controlled negative signal.

In modern LLMs (GPT-4, LLaMA, Mistral), SwiGLU has become the default. It combines the smooth gradient of SiLU with the adaptive gating of GLU. The Arena shows individual activations — but the frontier has moved to gated activations where two streams interact. SwiGLU won by combining good gradient flow with learned input-dependent gating.

Looking at the gradient panel: which activation has the highest maximum gradient for any input value?

Sigmoid — because it's the oldest and most studied ReLU — because its gradient is exactly 1 SiLU/Swish — its gradient exceeds 1.0 around x ≈ 1.1 (reaching ~1.1), making it the only standard activation with gradient > 1 Tanh — because tanh'(0) = 1.0

Chapter 9: Cheat Sheet & Connections

Ten chapters, eight activation functions, one gating mechanism. Here's everything compressed into a single reference table, followed by the decision tree for choosing the right activation and links to where these ideas lead next.

The Complete Reference

Name	Formula	Gradient	Key Property	Used In
Sigmoid	1/(1+e^-x)	σ(1-σ), max 0.25	Saturates both sides	Output gates, LSTM gates
Tanh	(e^x-e^-x)/(e^x+e^-x)	1 - tanh², max 1.0	Zero-centered, saturates	LSTM state, older RNNs
ReLU	max(0, x)	1 if x>0, else 0	Dead neurons, but simple	CNNs, default choice pre-2018
Leaky ReLU	x if x>0, αx if x≤0	1 if x>0, α if x≤0	No dead neurons	GANs, when ReLU dies
ELU	x if x>0, α(e^x-1) if x≤0	1 if x>0, αe^x if x≤0	Smooth, pushes mean toward 0	Niche use, research
GELU	x · Φ(x)	Φ(x) + x · φ(x)	Smooth, probabilistic gate	BERT, GPT-2, ViT
SiLU/Swish	x · σ(x)	σ(x)(1 + x(1-σ(x)))	Non-monotonic, gradient > 1	EfficientNet, Stable Diffusion
Mish	x · tanh(softplus(x))	Complex (see Ch 7)	Smoothest of all	YOLOv4, niche use
SwiGLU	SiLU(xW) ⊙ (xV)	Gated: gradient depends on both streams	Learned gating, 50% more params	LLaMA, PaLM, Mistral, GPT-4

The Decision Tree

What are you building?

Architecture determines the activation function, not the other way around.

↓

LLM / Transformer?

SwiGLU (LLaMA/Mistral-style) or GELU (BERT/GPT-2-style). SwiGLU is the modern default.

↓

CNN / Image model?

ReLU for simplicity, SiLU/Swish for best accuracy (EfficientNet proved this). GELU in Vision Transformers.

↓

GAN?

Leaky ReLU in the discriminator (prevents dead neurons with adversarial training). ReLU or Leaky in the generator.

↓

RNN / LSTM / GRU?

Tanh for cell state, sigmoid for gates. These are baked into the architecture — don't change them.

↓

Output layer?

Sigmoid for binary/multi-label classification. Softmax for multi-class. None (linear) for regression.

Where to Go Next

Loss Functions — activations shape the forward pass; loss functions shape what the network learns. Loss Functions covers MSE, cross-entropy, contrastive losses, and when to use each.

Normalization — BatchNorm, LayerNorm, and RMSNorm work hand-in-hand with activations. They re-center inputs before the activation, preventing the saturation that killed sigmoid networks. Normalization derives each technique from scratch.

Optimizers — Adam, AdamW, and learning rate schedules determine how the gradients (shaped by activations) become weight updates. Optimizers covers the full landscape.

Transformer — the architecture where SwiGLU lives. The feed-forward network in every transformer block uses an activation function. The Transformer lesson shows the complete architecture.

Backpropagation — we talked about gradient chains through activations. Backpropagation shows the full chain rule through every layer type, not just activations.

The meta-lesson: activation functions evolved from biological analogy (sigmoid) to mathematical pragmatism (ReLU) to empirical optimization (GELU, SiLU) to learned gating (SwiGLU). Each step made gradients flow better, enabling deeper networks. The next frontier isn't a new activation — it's architectures that make the activation choice matter less (residual connections, normalization, attention).

You're building a modern LLM from scratch. Which activation should you use in the feed-forward layers?

Sigmoid — it's the most well-understood activation ReLU — it's the simplest and fastest to compute SwiGLU — it combines smooth SiLU activation with learned gating and is used in LLaMA, PaLM, Mistral, and other state-of-the-art LLMs Tanh — it's zero-centered which helps training