Training Foundations

Activation Functions

The nonlinearities that make deep learning deep — from the sigmoid that started it all to the SwiGLU inside every modern LLM.

Prerequisites: What a neural network layer does + Derivatives (what a gradient is). That's it.
10
Chapters
12+
Simulations
0
Assumed Knowledge

Chapter 0: The Dead Network Problem

Stack ten linear layers. The result? Mathematically identical to a single linear layer. No matter how deep you go, you can only model straight lines, flat planes, linear boundaries. A 100-layer linear network has the same representational power as a 1-layer one. That's why activation functions exist — they inject the nonlinearity that lets deep networks model curves, corners, and complexity.

This isn't obvious until you see it fail. Take a classification problem where the classes aren't separable by a straight line — say, two interlocking spirals or concentric circles. A linear model, no matter how many layers, draws a single straight boundary. It fails catastrophically. Add one nonlinear activation function between each layer, and suddenly the network can bend, twist, and carve out curved decision regions.

But "just add nonlinearity" hides a trap. The wrong nonlinearity kills your network in a different way — by crushing gradients to zero so weights stop updating. This chapter shows both failure modes: the futility of depth without nonlinearity, and the danger of choosing poorly.

Why Linear Layers Collapse

A single linear layer computes y = Wx + b. Stack two:

y1 = W1 · x + b1
y2 = W2 · y1 + b2

Substitute y1 into the second equation:

y2 = W2 · (W1 · x + b1) + b2 = (W2 · W1) · x + (W2 · b1 + b2)

That's just Weff · x + beff — a single linear transformation. Two layers collapsed into one. Three layers? Same thing. A hundred layers? Still one effective matrix times one effective bias.

Hand Calculation: The Collapse in Action

Setup. Two 2×2 weight matrices, no bias, no activation. We'll multiply them together and show that two layers = one layer.

Layer 1 weights: W1 = [[1, 2], [3, 4]]

Layer 2 weights: W2 = [[0.5, -1], [1, 0.5]]

Compute Weff = W2 · W1:

So Weff = [[-2.5, -3.0], [2.5, 4.0]]. Feed in x = [1, 0]:

Identical. Two layers, one layer — same output. Depth bought us nothing. Now imagine stacking 10, 50, or 100 of these. It's still one matrix. The network has no more expressive power than a single layer, regardless of depth.

What Nonlinearity Buys You

Insert a nonlinear function σ between layers:

y2 = W2 · σ(W1 · x + b1) + b2

Now you can't factor out a single Weff. The σ breaks the linearity. Each additional layer adds a new "fold" or "bend" to the function the network can represent. With enough layers and nonlinearities, a neural network can approximate any continuous function — the universal approximation theorem.

The simulation below shows this dramatically. On the left, a linear network tries to classify points in two concentric circles. It can only draw a straight line. On the right, add ReLU between layers, and the decision boundary curves to fit the data.

Linear vs. Nonlinear Decision Boundaries

Left: linear network (no activations). Right: same architecture with ReLU. Toggle nonlinearity and adjust depth.

Layers 3

The difference is stark. With linearity, adding layers from 1 to 8 changes nothing — the boundary stays straight. With ReLU, more layers mean more expressive power: the boundary wraps tighter around the data.

python
import torch
import torch.nn as nn

# 10-layer linear network
linear_net = nn.Sequential(*[nn.Linear(2, 2) for _ in range(10)])

# Collapse all layers into one effective matrix
W_eff = torch.eye(2)
for layer in linear_net:
    W_eff = layer.weight @ W_eff
# W_eff is a SINGLE 2×2 matrix — 10 layers = 1 layer
print("10 layers collapsed to:", W_eff.shape)  # torch.Size([2, 2])
You might think "any nonlinearity will do." It won't. Sigmoid and tanh saturate for large inputs, crushing gradients to near-zero. ReLU kills neurons permanently when they output zero. The choice of activation function directly determines whether your gradients flow or die. The next chapters walk through each one — its strengths, its failure mode, and when to use it.
What happens if you stack 100 layers with no activation function between them?

Chapter 1: Sigmoid & Tanh — The Classics

The first activation functions were biological. Neurons either fire or don't — a smooth step function from 0 to 1. That's sigmoid. Its centered cousin, tanh, maps to (-1, 1). Both were the backbone of neural networks from the 1980s to ~2012. Both have a fatal flaw.

To understand that flaw, we need to derive these functions from scratch, compute their gradients by hand, and watch what happens when you chain those gradients through a deep network.

Deriving Sigmoid

We want a smooth function that maps any real number to the range (0, 1). It should be near 0 for very negative inputs, near 1 for very positive inputs, and transition smoothly around zero. The logistic function does exactly this:

σ(x) = 1 / (1 + e-x)

Why this particular form? The exponential e-x is huge when x is very negative (pushing the denominator up, making σ near 0) and tiny when x is very positive (denominator ≈ 1, making σ near 1). The transition happens around x = 0, where e0 = 1, so σ(0) = 1/(1+1) = 0.5.

The derivative has an elegant form. Starting from σ(x) = (1 + e-x)-1:

σ'(x) = σ(x) · (1 - σ(x))

This is beautiful but dangerous. The derivative is the product of σ(x) and its complement. The maximum occurs when both factors are equal — at x = 0, where σ(0) = 0.5, giving σ'(0) = 0.5 × 0.5 = 0.25.

Read that again: the maximum possible gradient of sigmoid is 0.25. Not 1. Not 0.5. A quarter. And it only gets worse from there.

Hand Calculation: Sigmoid Values and Gradients

Trace sigmoid across five input values. We compute both the output and the gradient at each point to see exactly where gradients die.
xe-xσ(x)σ'(x) = σ(1-σ)Status
-320.091/(1+20.09) = 0.0470.047 × 0.953 = 0.045Saturated low
-12.7181/(1+2.718) = 0.2690.269 × 0.731 = 0.197Weak gradient
01.0001/(1+1) = 0.5000.500 × 0.500 = 0.250Maximum gradient
10.3681/(1+0.368) = 0.7310.731 × 0.269 = 0.197Weak gradient
30.0501/(1+0.050) = 0.9530.953 × 0.047 = 0.045Saturated high

At x = ±3, the gradient is already 5.5× smaller than the maximum. At x = 5: σ(5) = 0.993, σ'(5) = 0.993 × 0.007 = 0.007. That's 35× smaller than the maximum.

Deriving Tanh

Sigmoid outputs are always positive (between 0 and 1). This means the next layer always receives positive inputs, which can cause zig-zagging during gradient descent. Tanh fixes this by centering the output around zero:

tanh(x) = (ex - e-x) / (ex + e-x)

The relationship between tanh and sigmoid is direct: tanh(x) = 2σ(2x) - 1. Tanh maps to (-1, 1) instead of (0, 1). Same S-shape, but centered at zero.

The derivative:

tanh'(x) = 1 - tanh2(x)

At x = 0: tanh(0) = 0, so tanh'(0) = 1 - 0 = 1.0. That's 4× better than sigmoid's 0.25 at the peak. But tanh still saturates: at x = 3, tanh(3) ≈ 0.995, giving tanh'(3) = 1 - 0.99520.010. At x = 5, tanh'(5) ≈ 0.00009. The gradient is essentially dead.

The Vanishing Gradient Chain

During backpropagation, the gradient at each layer is multiplied by the local activation gradient. In a 10-layer network with sigmoid activations, the gradient arriving at layer 1 is the product of 10 sigmoid derivatives.

Even at the maximum (0.25 per layer):

0.2510 = 0.25 × 0.25 × ... × 0.25 = 9.5 × 10-7

That's less than one millionth of the original gradient. Layer 1's weights essentially stop learning. This is the vanishing gradient problem, and it's why networks deeper than ~5 layers were nearly impossible to train before 2012.

LayersSigmoid (0.25n)Tanh best (1.0n)Tanh typical (0.6n)
30.01561.00.216
50.000981.00.0778
109.5 × 10-71.00.0060
209.1 × 10-131.03.7 × 10-5

The "tanh best" column is the theoretical maximum — all inputs at exactly zero. In practice, inputs wander away from zero, and tanh gradients shrink to ~0.6 or less. The "typical" column shows the realistic picture. Tanh is better than sigmoid, but still vanishes.

Sigmoid & Tanh: Curves and Gradient Chains

Toggle between sigmoid and tanh. Drag the input slider to see the gradient at each point. Below: watch the gradient chain shrink as layers increase.

Input x 0.0
Chain length 5
python
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_grad(x):
    s = sigmoid(x)
    return s * (1 - s)  # max = 0.25 at x=0

def tanh_grad(x):
    t = np.tanh(x)
    return 1 - t**2    # max = 1.0 at x=0

# Chain of 10 sigmoid gradients at x=0 (best case)
chain_10 = sigmoid_grad(0) ** 10
print(f"10-layer sigmoid chain: {chain_10:.2e}")  # 9.54e-07

# Chain of 10 tanh gradients at x=1 (realistic)
chain_10_tanh = tanh_grad(1) ** 10
print(f"10-layer tanh chain (x=1): {chain_10_tanh:.4f}")  # 0.0060
Sigmoid isn't "bad." It's still the right choice for binary classification output layers (where you WANT a probability in [0,1]) and for gates in LSTMs and GRUs (where the gate needs to smoothly interpolate between "fully open" and "fully closed"). The problem is using it as a hidden-layer activation in deep feedforward networks. Context matters — an activation function isn't universally good or bad; it's about where you put it.
Why does the sigmoid gradient vanish for large inputs?

Chapter 2: ReLU — The Revolution

In 2012, a simple insight changed everything. What if the activation function was just... a ramp? No exponentials, no saturation, no squashing. For positive inputs, pass them through unchanged. For negative inputs, output zero. That's ReLU — the Rectified Linear Unit — and it made deep learning possible.

ReLU(x) = max(0, x)

That's the entire definition. One line. No parameters, no exponentials, no division. Just a comparison and a zero. It's the simplest nonlinear function you can imagine, and it solved the vanishing gradient problem that had plagued neural networks for decades.

Why ReLU Works

The gradient of ReLU is:

ReLU'(x) = 1   if x > 0,    0   if x ≤ 0

For positive inputs, the gradient is exactly 1. Not 0.25 like sigmoid. Not 0.6 like typical tanh. Exactly 1. This means gradients flow through active ReLU neurons without any shrinkage.

In a 100-layer network where all neurons are active (positive inputs), the gradient chain is:

1 × 1 × 1 × ... × 1 = 1100 = 1

Compare with sigmoid:

0.25 × 0.25 × ... × 0.25 = 0.25100 ≈ 6 × 10-61

That's not a typo. 6 × 10-61. ReLU gives you a gradient of 1; sigmoid gives you something smaller than the number of atoms in the observable universe (10-80 territory for deeper networks). This is why AlexNet (2012), the model that launched the deep learning era, used ReLU — and everything after it followed suit.

Hand Calculation: ReLU vs. Sigmoid Gradients

Compare gradient flow through 10 layers. We trace the gradient for both activations at a realistic input value.
xReLU(x)ReLU'(x)σ(x)σ'(x)
-2000.1190.105
-1000.2690.197
000 or 10.5000.250
0.50.510.6220.235
1110.7310.197
3310.9530.045

Now chain 10 of these gradients together. Assume all neurons are active (x > 0):

Six to eight orders of magnitude difference. This is why ReLU enabled deep networks. Gradients that actually reach the early layers mean early layers actually learn.

The Dead Neuron Problem

ReLU's gradient for x ≤ 0 is exactly zero. If a neuron's input becomes negative — say, due to a large negative bias or an unlucky weight update — it outputs zero, its gradient is zero, its weights never update, and it stays at zero forever. The neuron is permanently dead.

This isn't rare. In practice, 10-40% of neurons in a ReLU network can die during training. The risk factors:

Think of it this way: sigmoid neurons get "sleepy" (vanishing gradients slow learning), but ReLU neurons can "die" (zero gradient means zero learning, forever). Sleepy neurons can eventually wake up if the gradient signal is strong enough. Dead neurons cannot.

ReLU vs. Sigmoid: Network Health

Top: ReLU and sigmoid curves with gradient overlay. Bottom: a grid of 64 neurons trained step-by-step. With ReLU, watch neurons die (go dark). With sigmoid, all survive but gradients fade. Adjust learning rate to see the effect.

Learning rate 0.10
python
import numpy as np

def relu(x):
    return np.maximum(0, x)  # one line — the simplest activation

def relu_grad(x):
    return (x > 0).astype(float)  # 1 if positive, 0 if not

# Compare gradient chains
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

x = 1.0
relu_chain_10 = relu_grad(x) ** 10
sig_chain_10 = (sigmoid(x) * (1 - sigmoid(x))) ** 10

print(f"ReLU  10-layer chain: {relu_chain_10:.1f}")   # 1.0
print(f"Sigmoid 10-layer chain: {sig_chain_10:.2e}")  # 7.18e-08
ReLU isn't differentiable at x = 0. In practice, this doesn't matter. We just pick either 0 or 1 as the gradient at x = 0 (convention: 0). Neural networks are trained with stochastic gradient descent on mini-batches — the probability of any single input landing at exactly 0.000... is essentially zero. The mathematical non-differentiability at a single point has zero practical impact. Don't let a calculus technicality scare you away from the most important activation function in deep learning history.
What is the "dying ReLU" problem?

Chapter 3: Leaky ReLU & ELU — Fixing Dead Neurons

Dead neurons waste capacity. If 30% of your network's neurons are dead, you're paying for a 30% bigger network than you're actually using. Compute, memory, parameters — all wasted on neurons that output zero forever and will never learn again.

The fix is elegantly simple: instead of outputting exactly zero for negative inputs, let a small signal through. A tiny leak. That's Leaky ReLU.

Deriving Leaky ReLU

LeakyReLU(x) = x   if x > 0,    αx   if x ≤ 0

Where α is a small positive constant, typically 0.01. For positive inputs, it behaves exactly like ReLU — gradient of 1, no saturation. For negative inputs, instead of a flat zero, the output is a gently sloping line with slope α.

The gradient:

LeakyReLU'(x) = 1   if x > 0,    α   if x ≤ 0

For negative inputs, the gradient is α = 0.01. Tiny, but nonzero. Dead neurons become "drowsy" neurons — they can still receive gradient signal and eventually wake up. A neuron that got pushed into negative territory by a bad weight update can slowly recover, because 0.01 of the gradient still flows through.

Deriving ELU

Exponential Linear Unit (ELU) takes a different approach to the negative region. Instead of a straight line, it uses an exponential curve:

ELU(x) = x   if x > 0,    α(ex - 1)   if x ≤ 0

For large negative x, ex approaches 0, so ELU approaches -α. The function has a smooth asymptote at -α for the negative side, while being identical to ReLU for positive inputs.

The gradient:

ELU'(x) = 1   if x > 0,    α · ex   if x ≤ 0

Key difference from Leaky ReLU: the ELU gradient for very negative inputs approaches 0 (not α). At x = -1, the gradient is α · e-1 ≈ 0.368α. At x = -5, it's α · e-5 ≈ 0.0067α. This provides soft saturation — extremely negative inputs get suppressed, acting as a noise filter, while moderately negative inputs still pass gradient.

Hand Calculation: Comparing the Variants

Compute all three activations and gradients. Leaky ReLU uses α = 0.01, ELU uses α = 1.0. We trace five input values.
xReLUReLU'LeakyLeaky'ELUELU'
-3 0 0 -0.03 0.01 -0.950 0.050
-1 0 0 -0.01 0.01 -0.632 0.368
0 0 0 0 1 0 1
1 1 1 1 1 1 1
3 3 1 3 1 3 1

Let's verify the ELU values at x = -3 step by step:

Compare that gradient of 0.050 with ReLU's 0. The ELU neuron at x = -3 is still learning — slowly, but it hasn't died. And at x = -1, ELU's gradient is 0.368 — quite healthy. Leaky ReLU's gradient at x = -1 is a constant 0.01 regardless of how negative the input gets. ELU gives stronger gradients near zero and weaker ones far away.

PReLU — Learning the Slope

Parametric ReLU (PReLU), proposed by He et al. (2015), is Leaky ReLU where α is a learnable parameter. Instead of fixing α = 0.01, the network decides how much negative signal to let through during training.

PReLU(x) = x   if x > 0,    αx   if x ≤ 0    (α learned via backprop)

In their ImageNet experiments, He et al. found that the learned α values varied by layer. Early layers learned larger α (more negative signal preserved), while later layers learned smaller α. The network was automatically tuning the activation function per layer — something hand-tuning could never achieve efficiently.

PReLU added only one parameter per channel (or per layer), so the overhead is negligible. It improved ImageNet top-5 error by ~0.5% over standard ReLU — small in absolute terms, but significant at the frontier of accuracy at the time.

Activation Function Comparison

All three functions on the same axes. Adjust α to see how the negative region changes. Below: gradient heatmaps showing gradient strength across the input range. Further below: the dead neuron counter from Ch2 — compare ReLU vs. Leaky ReLU vs. ELU.

α 0.010
python
import numpy as np
import torch
import torch.nn as nn

# From scratch
def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

def elu(x, alpha=1.0):
    return np.where(x > 0, x, alpha * (np.exp(x) - 1))

# PyTorch equivalents
leaky = nn.LeakyReLU(negative_slope=0.01)
elu_layer = nn.ELU(alpha=1.0)
prelu = nn.PReLU(num_parameters=1)  # alpha is learnable!

x = torch.linspace(-3, 3, 7)
print("Leaky:", leaky(x).data)
print("ELU:  ", elu_layer(x).data)
print("PReLU:", prelu(x).data)
print("PReLU alpha:", prelu.weight.item())  # initial: 0.25
You might think Leaky ReLU is always better than ReLU since it fixes dead neurons. In practice, the difference is often negligible for well-initialized networks with reasonable learning rates. ReLU's simplicity — one comparison, no multiply for the positive branch — gives it a slight speed advantage on GPUs. The dead neuron problem is real but often overstated. Most networks work fine with regular ReLU unless learning rates are very high or initialization is poor. Leaky ReLU and ELU are your fallback when you do see dying neurons in your training logs — not the default first choice.
How does Leaky ReLU prevent dead neurons?

Chapter 4: GELU — The Transformer's Choice

ReLU makes a binary decision: positive inputs pass, negative inputs die. But what if instead of a hard gate, we used a soft one? What if the probability of passing an input through depended on how large it is?

Inputs of +3 almost certainly pass. Inputs of -3 almost certainly get zeroed. Inputs near 0 get a coin flip. That probabilistic interpretation is GELU — the Gaussian Error Linear Unit.

Proposed by Hendrycks and Gimpel in 2016, GELU became the default activation for BERT, GPT-2, GPT-3, and Vision Transformer (ViT). If you use a transformer today, you're almost certainly using GELU somewhere inside it.

The Stochastic Regularization Interpretation

Here's where GELU comes from. Imagine you have an input x to a neuron. Instead of passing it through directly, you multiply it by a random mask — a Bernoulli random variable that's either 0 or 1. If the mask is 1, the input passes. If 0, it's dropped. Sound familiar? That's dropout.

But dropout uses a fixed probability (like 0.1). What if the dropout probability depended on the value of x itself? Specifically: the probability that the input passes is Φ(x), the standard normal CDF — the probability that a draw from a standard normal distribution is less than or equal to x.

Large positive x? Almost all of the normal distribution is below you, so Φ(x) ≈ 1 — you pass. Large negative x? Almost none is below you, so Φ(x) ≈ 0 — you're dropped. Near zero? Φ(0) = 0.5 — a coin flip.

The expected value of this stochastic process IS GELU. For input x, the mask is Bernoulli(Φ(x)). The expected output is: E[x · Bernoulli(Φ(x))] = x · Φ(x). That's the entire GELU formula. No curve fitting, no heuristics — just the expected value of a probabilistic gate.

The Formula

GELU(x) = x · Φ(x) = x · 0.5 · (1 + erf(x / √2))

Where erf is the Gauss error function. Since erf can be expensive to compute, there's a fast approximation used in practice:

GELU(x) ≈ 0.5x · (1 + tanh(√(2/π) · (x + 0.044715x³)))

The tanh approximation is what frameworks like PyTorch use when you pass approximate='tanh'. The exact version uses erf directly. Both produce nearly identical results — the max difference is about 0.0003.

Hand Calculation: GELU at Five Points

Let's compute GELU by hand at five inputs. We need Φ(x) — the standard normal CDF — which you can look up in a Z-table or compute from erf.

x = -2:

x = -1:

x = 0:

x = 1:

x = 2:

Pattern: For positive inputs, GELU < ReLU (the gate isn't fully open). For negative inputs, GELU ≠ 0 (the gate isn't fully closed). GELU interpolates smoothly between "pass" and "suppress" — there's no sharp corner at zero.

Why GELU Won for Transformers

Three properties made GELU the transformer default:

1. Smooth everywhere. GELU has continuous derivatives of all orders. ReLU has a discontinuous first derivative at x=0. This smoothness means gradients never have sudden jumps, which helps optimizers like Adam maintain stable momentum estimates.

2. Non-monotonic near zero. GELU has a small dip below zero near x ≈ -0.17, where GELU(x) ≈ -0.17. This means the function isn't strictly increasing — it actually decreases slightly for small negative inputs before flattening to zero. This non-monotonicity acts as a form of built-in regularization.

3. Stochastic regularization. Because GELU can be interpreted as expected dropout, it provides an implicit regularization effect during forward passes. Empirically, BERT trained with GELU converges faster and generalizes better than with ReLU.

Interactive: GELU vs ReLU

GELU Explorer

Drag the slider to move a probe along the x-axis. Top: GELU vs ReLU curves. Middle: the gate probability Φ(x). Bottom: gradients compared.

x 0.00

Code: GELU from Scratch

python
import torch
import math

# Exact GELU using the error function
def gelu_exact(x):
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))

# Fast tanh approximation (what PyTorch uses internally)
def gelu_tanh(x):
    return 0.5 * x * (1.0 + torch.tanh(
        math.sqrt(2.0 / math.pi) * (x + 0.044715 * x**3)
    ))

# Verify against PyTorch built-in
x = torch.linspace(-3, 3, 7)
print("Exact:  ", gelu_exact(x))
print("Approx: ", gelu_tanh(x))
print("PyTorch:", torch.nn.functional.gelu(x))
# Max difference between exact and approx: ~0.0003
GELU is NOT just "smooth ReLU." It's non-monotonic — it has a dip below zero near x ≈ -0.17, where GELU(x) ≈ -0.17. ReLU is always non-negative. This non-monotonicity means GELU can output negative values for slightly negative inputs, providing a richer gradient signal than ReLU's flat zero. The non-monotonicity isn't a bug — it's the regularization.
What gives GELU its "soft gating" behavior?

Chapter 5: SiLU/Swish — The Self-Gated Activation

In 2017, Google Brain tried something unusual. Instead of designing an activation function by hand, they used neural architecture search — an AI designing AI components. They searched over a space of simple mathematical operations and evaluated each candidate on real tasks.

The winner was remarkably simple: multiply the input by its own sigmoid. They called it Swish.

Swish(x) = x · σ(x) = x / (1 + e-x)

That's it. The sigmoid σ(x) acts as a gate: for large positive x, σ(x) ≈ 1 so Swish(x) ≈ x (identity). For large negative x, σ(x) ≈ 0 so Swish(x) ≈ 0 (suppression). Near zero, the sigmoid gives a smooth interpolation.

The β Parameter

The full Swish formula includes a learnable (or fixed) parameter β:

Swishβ(x) = x · σ(βx)

This parameter controls the sharpness of the gate:

Swish interpolates between a linear function and ReLU. At β=0, it's linear (x/2). At β=∞, it's ReLU. At β=1, it's somewhere in between — smooth, non-monotonic, and just right for most tasks.

The Gradient — Never Zero

The Swish gradient is more complex than ReLU's but has a critical advantage: it's never zero for any finite input.

Swish'(x) = σ(x) + x · σ(x) · (1 - σ(x)) = σ(x) · (1 + x · (1 - σ(x)))

At x = 0: σ(0) = 0.5, so Swish'(0) = 0.5 · (1 + 0) = 0.5. Compare this to ReLU, whose gradient jumps from 0 to 1 at x=0. Swish has no discontinuity — a smooth transition through every input value.

Even for large negative x (like x = -10), σ(-10) ≈ 0.0000454, which is tiny but nonzero. Swish neurons never fully die — there's always a small gradient signal to nudge them back to life.

Hand Calculation: Swish at Five Points

Using β = 1 (standard SiLU). We need σ(x) = 1/(1 + e-x).

x = -2:

x = -1:

x = 0:

x = 1:

x = 2:

The minimum of Swish occurs at x ≈ -1.28, where Swish(x) ≈ -0.278. This is a deeper dip than GELU's minimum of -0.17. Both are non-monotonic, but Swish allows larger negative outputs.

SiLU — The Same Thing, Different Name

SiLU (Sigmoid Linear Unit) is exactly Swish with β = 1. The name was proposed independently by Elfwing et al. in 2018. PyTorch uses torch.nn.SiLU(). You'll see both names — they're the same function.

Where SiLU/Swish appears in the wild:

ModelWhereYear
EfficientNetAll conv layers2019
LLaMA / LLaMA 2FFN gate (via SwiGLU)2023
Mistral / MixtralFFN gate (via SwiGLU)2023
Stable DiffusionU-Net conv blocks2022
GemmaFFN gate (via GeGLU)2024

Interactive: Swish with Adjustable β

Swish β Explorer

Slide β from 0 (linear) to 5 (near-ReLU). Watch the curve morph. Below: gradient comparison with ReLU and GELU.

β 1.00

Code: SiLU/Swish from Scratch

python
import torch

# SiLU / Swish (beta=1)
def silu(x):
    return x * torch.sigmoid(x)

# General Swish with adjustable beta
def swish(x, beta=1.0):
    return x * torch.sigmoid(beta * x)

# Gradient of SiLU (for understanding)
def silu_grad(x):
    s = torch.sigmoid(x)
    return s * (1 + x * (1 - s))

# Verify against PyTorch built-in
x = torch.linspace(-3, 3, 7)
print("Ours:   ", silu(x))
print("PyTorch:", torch.nn.functional.silu(x))
# Identical — SiLU IS Swish with beta=1

# Beta sweep: watch Swish morph from linear to ReLU
for beta in [0, 0.5, 1, 2, 5, 20]:
    y = swish(torch.tensor(1.0), beta)
    print(f"beta={beta:4}  Swish(1)={y:.4f}")
# beta=0 → 0.5, beta=1 → 0.731, beta=20 → 1.000 (≈ReLU)
Swish and GELU look very similar and have nearly identical performance. The practical difference is negligible for most tasks. GELU is standard in NLP transformers (BERT, GPT, ViT) because it was adopted first. SiLU/Swish is standard in vision models (EfficientNet) and shows up in LLM FFNs via SwiGLU. Don't agonize over the choice — either works.
What happens to Swish as β increases toward infinity?

Chapter 6: Gated Linear Units — SwiGLU & GeGLU

Every transformer has two main components per layer: attention and a feed-forward network (FFN). We've been talking about what activation goes inside the FFN. But what if the FFN's entire structure changed?

What if instead of one path through an activation, you had two paths — one for content and one for deciding what to keep? That's the Gated Linear Unit (GLU).

The Standard FFN — One Path

In the original transformer (Vaswani 2017), each layer's FFN is:

Input x
[batch, seq, d_model]
W1 · x + b1
Project up: d_model → d_ff (usually 4× d_model)
ReLU / GELU
Apply activation element-wise
W2 · h + b2
Project down: d_ff → d_model
Output
[batch, seq, d_model]

Two weight matrices. One activation. One path. Simple. The activation (ReLU or GELU) decides which features to suppress. But it makes that decision based only on the magnitude of each value independently.

The GLU — Two Paths

A Gated Linear Unit splits the FFN into two parallel projections:

Input x
[batch, seq, d_model]
↓ (split into two paths)
Wgate · x
Gate path: d_model → d_ff
Wup · x
Content path: d_model → d_ff
↓ apply activation to gate path
σ(Wgate · x) ⊙ (Wup · x)
Element-wise multiply: gate controls what content passes
Wdown · h
Project down: d_ff → d_model
Output
[batch, seq, d_model]

The key insight: the gate path and the content path are different linear projections of the same input. The gate learns which features to keep. The content learns what values to produce. The element-wise product lets the network learn feature selection — a much richer operation than applying an activation function element-wise.

Think of it like an audio mixing board. The content path produces all the audio tracks. The gate path is a row of faders — each one independently controlling the volume of one track. The activation function on the gate (SiLU, GELU, sigmoid) determines how the faders behave. Without gating, each track can only be "on" or "off" based on its own volume. With gating, a quiet track can be amplified and a loud track can be muted — based on the full context of the input.

GLU Variants

The activation function on the gate path defines the GLU variant:

VariantGate ActivationFormulaUsed In
GLUσ(x) (sigmoid)σ(xWg) ⊙ xWupOriginal (Dauphin 2017)
SwiGLUSiLU/SwishSiLU(xWg) ⊙ xWupLLaMA, Mistral, PaLM
GeGLUGELUGELU(xWg) ⊙ xWupGemma, some T5 variants
ReGLUReLUReLU(xWg) ⊙ xWupExperimental

SwiGLU (Shazeer, 2020) emerged as the winner. In comprehensive experiments across language modeling benchmarks, SwiGLU beat all other GLU variants and all non-gated FFNs. Google adopted it for PaLM. Meta adopted it for LLaMA. Now every major open-weight LLM uses SwiGLU.

The 2/3 Rule — Free Gating

You might worry that GLU adds parameters. After all, it has three weight matrices (Wgate, Wup, Wdown) instead of two (W1, W2). Let's count.

Standard FFN:

GLU FFN with hidden dimension d':

To match parameter counts: 3 × d_model × d' = 2 × d_model × d_ff. Solve: d' = 2/3 × d_ff.

The 2/3 rule: Set the GLU hidden dimension to 2/3 of the standard FFN hidden dimension. You get the gating mechanism for free — same total parameter count, better expressiveness. In practice, LLaMA uses d_ff = (2/3) × 4 × d_model ≈ 2.67 × d_model, rounded to the nearest multiple of 256.

Hand Calculation: SwiGLU Step by Step

Let's trace a 4D input through both a standard FFN and a SwiGLU FFN. d_model = 4.

Input: x = [0.5, -1.0, 0.3, 0.8]

Standard FFN (d_ff = 8, so W1 is 4×8):

Suppose after W1, the 8D hidden vector is: h = [1.2, -0.8, 0.5, -1.5, 2.1, 0.3, -0.1, 0.9]

After ReLU: [1.2, 0, 0.5, 0, 2.1, 0.3, 0, 0.9]. Three features killed outright — the ReLU decided independently for each.

SwiGLU FFN (d' = 5 to match params — since 3×4×5 = 60 ≈ 2×4×8 = 64):

Gate path (Wgate · x): [0.8, -1.2, 0.4, -0.3, 1.5]

Content path (Wup · x): [1.1, 0.7, -0.9, 0.5, 0.2]

Apply SiLU to gate path:

Element-wise multiply (gate ⊙ content):

Hidden after gating: [0.607, -0.195, -0.216, -0.064, 0.245]. Then Wdown projects back to d_model = 4.

Notice: no features were killed. The gate modulated each value based on learned context, rather than making a binary pass/kill decision. Feature 4 has a small gate but small content too. Feature 1 has a negative gate that partially inverts the content. This is richer than ReLU could ever be.

SHOWCASE: FFN Architecture Arena

FFN Architecture Comparison

Toggle between architectures to see data flow. Click "Compute!" to animate a sample input through the network. Adjust dimensions to see how parameter counts change.

d_model 128

Code: SwiGLU FFN from Scratch

python
import torch
import torch.nn as nn

class StandardFFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff, bias=False)
        self.w2 = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x):
        return self.w2(torch.relu(self.w1(x)))

class SwiGLUFFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        # 2/3 rule: hidden_dim = 2/3 * d_ff
        hidden = int(2 * d_ff / 3)
        hidden = hidden + (256 - hidden % 256) % 256  # round up to 256
        self.w_gate = nn.Linear(d_model, hidden, bias=False)
        self.w_up   = nn.Linear(d_model, hidden, bias=False)
        self.w_down = nn.Linear(hidden, d_model, bias=False)

    def forward(self, x):
        gate = nn.functional.silu(self.w_gate(x))  # SiLU = Swish
        up   = self.w_up(x)
        return self.w_down(gate * up)             # element-wise gating

# Parameter count comparison
d, ff = 4096, 4 * 4096
std = StandardFFN(d, ff)
glu = SwiGLUFFN(d, ff)
std_p = sum(p.numel() for p in std.parameters())
glu_p = sum(p.numel() for p in glu.parameters())
print(f"Standard FFN: {std_p:,} params")  # 33,554,432
print(f"SwiGLU FFN:   {glu_p:,} params")  # ~33,816,576 (close!)
Why SwiGLU beat everything else. Shazeer (2020) tested 8 GLU variants on language modeling perplexity. SwiGLU won consistently. The reason isn't just the activation choice — it's the gating structure. Having two separate projections lets the network learn which features to amplify and which to suppress, based on the full input context. Standard FFNs can only suppress features based on each feature's own magnitude. That's the real power of GLU — learned feature selection at every layer.

Chapter 7: Mish & The Unifying Pattern

Notice a pattern? GELU multiplies the input by its normal CDF. Swish multiplies the input by its sigmoid. Every modern activation function is the same template: x times some smooth gate function. Mish continues this pattern, and understanding the template is more valuable than memorizing individual formulas.

The x · gate(x) Framework

Every modern activation function can be written as:

f(x) = x · g(x)

Where g: ℝ → [0, 1] is a smooth function that approaches 1 for large positive x and 0 for large negative x. The prefix ensures the function behaves like the identity for large positive inputs. The gate ensures suppression for large negative inputs. Different gate functions give different activations:

ActivationGate g(x)Formula
SiLU/Swishσ(x) = 1/(1+e-x)x · σ(x)
GELUΦ(x) = 0.5(1+erf(x/√2))x · Φ(x)
Mishtanh(softplus(x))x · tanh(ln(1+ex))

All three gates have the same shape: a smooth S-curve from 0 to 1. They differ in exactly where they transition and how quickly, but the overall behavior is nearly identical.

Once you see the template, you understand all modern activations. Stop memorizing individual formulas. Instead, remember: f(x) = x · (smooth gate from 0 to 1). The only question is which gate — and in practice, the choice barely matters.

Mish — The Third Member

Proposed by Diganta Misra in 2019, Mish uses a gate built from two familiar pieces: softplus and tanh.

Mish(x) = x · tanh(softplus(x)) = x · tanh(ln(1 + ex))

Let's unpack this from the inside out:

softplus(x) = ln(1 + ex) — a smooth approximation of ReLU. For large positive x, softplus(x) ≈ x. For large negative x, softplus(x) ≈ 0. At x=0, softplus(0) = ln(2) ≈ 0.693.

tanh squashes its input to the range [-1, 1]. Since softplus is always non-negative, tanh(softplus(x)) is always in [0, 1] — a valid gate. For large positive x: softplus ≈ x, tanh(x) ≈ 1, so the gate ≈ 1. For large negative x: softplus ≈ 0, tanh(0) = 0, so the gate ≈ 0. Exactly the behavior we need.

Hand Calculation: Mish Step by Step

x = -2:

x = -1:

x = 0:

x = 1:

x = 2:

Compare the three activations at the same inputs:

xGELUSiLUMishReLU
-2-0.046-0.238-0.2530
-1-0.159-0.269-0.3030
00000
10.8410.7310.8641
21.9541.7621.9432

The differences are small — concentrated in the range x ∈ [-2, 0]. For positive inputs, all three converge toward ReLU. Mish sits between GELU and SiLU for negative values but is very close to GELU for positive ones.

Where Mish Shows Up

Mish gained adoption primarily in computer vision:

In NLP/LLMs, Mish never gained traction — GELU and SiLU were already established. But Mish proved that the x·gate(x) template is robust: you can swap in different gates and get similar performance.

Interactive: All Modern Activations

Activation Comparison Arena

Toggle each activation on/off. Top: function curves. Bottom: difference from ReLU — the deviations are tiny and concentrated near zero.

Code: The Unified Template

python
import torch
import torch.nn.functional as F
import math

# The x * gate(x) template
def gated_activation(x, gate_fn):
    """All modern activations: f(x) = x * gate(x)"""
    return x * gate_fn(x)

# Different gates
def sigmoid_gate(x):    return torch.sigmoid(x)       # → SiLU
def phi_gate(x):        return 0.5 * (1 + torch.erf(x / math.sqrt(2)))  # → GELU
def mish_gate(x):       return torch.tanh(F.softplus(x))  # → Mish

# All three from the same template
x = torch.linspace(-3, 3, 100)
silu_out = gated_activation(x, sigmoid_gate)
gelu_out = gated_activation(x, phi_gate)
mish_out = gated_activation(x, mish_gate)

# How different are they really?
print("Max |GELU - SiLU|:", (gelu_out - silu_out).abs().max().item())  # ~0.12
print("Max |GELU - Mish|:", (gelu_out - mish_out).abs().max().item())  # ~0.03
print("Max |SiLU - Mish|:", (silu_out - mish_out).abs().max().item())  # ~0.09
# Tiny differences — all three are nearly interchangeable
The differences between GELU, SiLU, and Mish are tiny. On most benchmarks, they're within 0.1-0.3% of each other. Don't chase activation function benchmarks — the choice of GELU vs SiLU vs Mish matters far less than learning rate, batch size, or model architecture. Pick what your framework or model family uses and move on.
What unifying pattern do GELU, SiLU/Swish, and Mish all share?

Chapter 8: The Arena

We've studied each activation function in isolation. Now let's put them all on the same axes and watch them compete. The simulation below plots every function we've covered — with its gradient — so you can see at a glance how they differ in the regions that matter most: near zero, deep negative, and far positive.

The key tradeoffs become visible immediately. Sigmoid and tanh saturate on both sides. ReLU is dead on the left but perfectly linear on the right. GELU and SiLU gently curve through zero, allowing a small negative region. Mish is the smoothest of all. And SwiGLU isn't shown directly because it's a gated mechanism applied to two streams — you saw it in Chapter 6.

What to Look For

Activation Function Arena

All functions on the same axes. Toggle each one on/off. Top panel: function values. Bottom panel: gradients (derivatives). Drag the x-marker to read exact values.

Probe x 0.00

The Gradient Story

Look at the bottom panel (gradients) with all functions enabled. The picture tells the entire story of activation function evolution:

  1. Sigmoid (1986): gradient peaks at 0.25 and vanishes on both sides. A ceiling that limits depth.
  2. Tanh (1991): gradient peaks at 1.0 but still vanishes on both sides. Better, but deep networks still struggle.
  3. ReLU (2012): gradient is exactly 1 for x > 0, exactly 0 for x < 0. Binary: alive or dead. Enabled networks of 100+ layers but kills neurons.
  4. Leaky ReLU (2013): gradient is 1 for x > 0, 0.01 for x < 0. No dead neurons, but the negative gradient is tiny.
  5. GELU (2016): gradient smoothly transitions through zero, reaching ~1 for large positive x. Used in BERT, GPT-2, ViT.
  6. SiLU/Swish (2017): nearly identical to GELU in practice. The gradient has a small bump above 1.0 around x ≈ 1.1. Used in EfficientNet.
  7. Mish (2019): the smoothest gradient curve. Barely distinguishable from SiLU in practice, but its second derivative is smoother.

The trend is clear: each generation produced smoother, more gradient-friendly activations. The field converged on functions that are smooth at zero, have gradient near 1 for positive inputs, and allow a small controlled negative signal.

In modern LLMs (GPT-4, LLaMA, Mistral), SwiGLU has become the default. It combines the smooth gradient of SiLU with the adaptive gating of GLU. The Arena shows individual activations — but the frontier has moved to gated activations where two streams interact. SwiGLU won by combining good gradient flow with learned input-dependent gating.
Looking at the gradient panel: which activation has the highest maximum gradient for any input value?

Chapter 9: Cheat Sheet & Connections

Ten chapters, eight activation functions, one gating mechanism. Here's everything compressed into a single reference table, followed by the decision tree for choosing the right activation and links to where these ideas lead next.

The Complete Reference

NameFormulaGradientKey PropertyUsed In
Sigmoid 1/(1+e-x) σ(1-σ), max 0.25 Saturates both sides Output gates, LSTM gates
Tanh (ex-e-x)/(ex+e-x) 1 - tanh2, max 1.0 Zero-centered, saturates LSTM state, older RNNs
ReLU max(0, x) 1 if x>0, else 0 Dead neurons, but simple CNNs, default choice pre-2018
Leaky ReLU x if x>0, αx if x≤0 1 if x>0, α if x≤0 No dead neurons GANs, when ReLU dies
ELU x if x>0, α(ex-1) if x≤0 1 if x>0, αex if x≤0 Smooth, pushes mean toward 0 Niche use, research
GELU x · Φ(x) Φ(x) + x · φ(x) Smooth, probabilistic gate BERT, GPT-2, ViT
SiLU/Swish x · σ(x) σ(x)(1 + x(1-σ(x))) Non-monotonic, gradient > 1 EfficientNet, Stable Diffusion
Mish x · tanh(softplus(x)) Complex (see Ch 7) Smoothest of all YOLOv4, niche use
SwiGLU SiLU(xW) ⊙ (xV) Gated: gradient depends on both streams Learned gating, 50% more params LLaMA, PaLM, Mistral, GPT-4

The Decision Tree

What are you building?
Architecture determines the activation function, not the other way around.
LLM / Transformer?
SwiGLU (LLaMA/Mistral-style) or GELU (BERT/GPT-2-style). SwiGLU is the modern default.
CNN / Image model?
ReLU for simplicity, SiLU/Swish for best accuracy (EfficientNet proved this). GELU in Vision Transformers.
GAN?
Leaky ReLU in the discriminator (prevents dead neurons with adversarial training). ReLU or Leaky in the generator.
RNN / LSTM / GRU?
Tanh for cell state, sigmoid for gates. These are baked into the architecture — don't change them.
Output layer?
Sigmoid for binary/multi-label classification. Softmax for multi-class. None (linear) for regression.

Where to Go Next

Loss Functions — activations shape the forward pass; loss functions shape what the network learns. Loss Functions covers MSE, cross-entropy, contrastive losses, and when to use each.

Normalization — BatchNorm, LayerNorm, and RMSNorm work hand-in-hand with activations. They re-center inputs before the activation, preventing the saturation that killed sigmoid networks. Normalization derives each technique from scratch.

Optimizers — Adam, AdamW, and learning rate schedules determine how the gradients (shaped by activations) become weight updates. Optimizers covers the full landscape.

Transformer — the architecture where SwiGLU lives. The feed-forward network in every transformer block uses an activation function. The Transformer lesson shows the complete architecture.

Backpropagation — we talked about gradient chains through activations. Backpropagation shows the full chain rule through every layer type, not just activations.

The meta-lesson: activation functions evolved from biological analogy (sigmoid) to mathematical pragmatism (ReLU) to empirical optimization (GELU, SiLU) to learned gating (SwiGLU). Each step made gradients flow better, enabling deeper networks. The next frontier isn't a new activation — it's architectures that make the activation choice matter less (residual connections, normalization, attention).
You're building a modern LLM from scratch. Which activation should you use in the feed-forward layers?