The nonlinearities that make deep learning deep — from the sigmoid that started it all to the SwiGLU inside every modern LLM.
Stack ten linear layers. The result? Mathematically identical to a single linear layer. No matter how deep you go, you can only model straight lines, flat planes, linear boundaries. A 100-layer linear network has the same representational power as a 1-layer one. That's why activation functions exist — they inject the nonlinearity that lets deep networks model curves, corners, and complexity.
This isn't obvious until you see it fail. Take a classification problem where the classes aren't separable by a straight line — say, two interlocking spirals or concentric circles. A linear model, no matter how many layers, draws a single straight boundary. It fails catastrophically. Add one nonlinear activation function between each layer, and suddenly the network can bend, twist, and carve out curved decision regions.
But "just add nonlinearity" hides a trap. The wrong nonlinearity kills your network in a different way — by crushing gradients to zero so weights stop updating. This chapter shows both failure modes: the futility of depth without nonlinearity, and the danger of choosing poorly.
A single linear layer computes y = Wx + b. Stack two:
Substitute y1 into the second equation:
That's just Weff · x + beff — a single linear transformation. Two layers collapsed into one. Three layers? Same thing. A hundred layers? Still one effective matrix times one effective bias.
Layer 1 weights: W1 = [[1, 2], [3, 4]]
Layer 2 weights: W2 = [[0.5, -1], [1, 0.5]]
Compute Weff = W2 · W1:
So Weff = [[-2.5, -3.0], [2.5, 4.0]]. Feed in x = [1, 0]:
Identical. Two layers, one layer — same output. Depth bought us nothing. Now imagine stacking 10, 50, or 100 of these. It's still one matrix. The network has no more expressive power than a single layer, regardless of depth.
Insert a nonlinear function σ between layers:
Now you can't factor out a single Weff. The σ breaks the linearity. Each additional layer adds a new "fold" or "bend" to the function the network can represent. With enough layers and nonlinearities, a neural network can approximate any continuous function — the universal approximation theorem.
The simulation below shows this dramatically. On the left, a linear network tries to classify points in two concentric circles. It can only draw a straight line. On the right, add ReLU between layers, and the decision boundary curves to fit the data.
Left: linear network (no activations). Right: same architecture with ReLU. Toggle nonlinearity and adjust depth.
The difference is stark. With linearity, adding layers from 1 to 8 changes nothing — the boundary stays straight. With ReLU, more layers mean more expressive power: the boundary wraps tighter around the data.
python import torch import torch.nn as nn # 10-layer linear network linear_net = nn.Sequential(*[nn.Linear(2, 2) for _ in range(10)]) # Collapse all layers into one effective matrix W_eff = torch.eye(2) for layer in linear_net: W_eff = layer.weight @ W_eff # W_eff is a SINGLE 2×2 matrix — 10 layers = 1 layer print("10 layers collapsed to:", W_eff.shape) # torch.Size([2, 2])
The first activation functions were biological. Neurons either fire or don't — a smooth step function from 0 to 1. That's sigmoid. Its centered cousin, tanh, maps to (-1, 1). Both were the backbone of neural networks from the 1980s to ~2012. Both have a fatal flaw.
To understand that flaw, we need to derive these functions from scratch, compute their gradients by hand, and watch what happens when you chain those gradients through a deep network.
We want a smooth function that maps any real number to the range (0, 1). It should be near 0 for very negative inputs, near 1 for very positive inputs, and transition smoothly around zero. The logistic function does exactly this:
Why this particular form? The exponential e-x is huge when x is very negative (pushing the denominator up, making σ near 0) and tiny when x is very positive (denominator ≈ 1, making σ near 1). The transition happens around x = 0, where e0 = 1, so σ(0) = 1/(1+1) = 0.5.
The derivative has an elegant form. Starting from σ(x) = (1 + e-x)-1:
This is beautiful but dangerous. The derivative is the product of σ(x) and its complement. The maximum occurs when both factors are equal — at x = 0, where σ(0) = 0.5, giving σ'(0) = 0.5 × 0.5 = 0.25.
Read that again: the maximum possible gradient of sigmoid is 0.25. Not 1. Not 0.5. A quarter. And it only gets worse from there.
| x | e-x | σ(x) | σ'(x) = σ(1-σ) | Status |
|---|---|---|---|---|
| -3 | 20.09 | 1/(1+20.09) = 0.047 | 0.047 × 0.953 = 0.045 | Saturated low |
| -1 | 2.718 | 1/(1+2.718) = 0.269 | 0.269 × 0.731 = 0.197 | Weak gradient |
| 0 | 1.000 | 1/(1+1) = 0.500 | 0.500 × 0.500 = 0.250 | Maximum gradient |
| 1 | 0.368 | 1/(1+0.368) = 0.731 | 0.731 × 0.269 = 0.197 | Weak gradient |
| 3 | 0.050 | 1/(1+0.050) = 0.953 | 0.953 × 0.047 = 0.045 | Saturated high |
At x = ±3, the gradient is already 5.5× smaller than the maximum. At x = 5: σ(5) = 0.993, σ'(5) = 0.993 × 0.007 = 0.007. That's 35× smaller than the maximum.
Sigmoid outputs are always positive (between 0 and 1). This means the next layer always receives positive inputs, which can cause zig-zagging during gradient descent. Tanh fixes this by centering the output around zero:
The relationship between tanh and sigmoid is direct: tanh(x) = 2σ(2x) - 1. Tanh maps to (-1, 1) instead of (0, 1). Same S-shape, but centered at zero.
The derivative:
At x = 0: tanh(0) = 0, so tanh'(0) = 1 - 0 = 1.0. That's 4× better than sigmoid's 0.25 at the peak. But tanh still saturates: at x = 3, tanh(3) ≈ 0.995, giving tanh'(3) = 1 - 0.9952 ≈ 0.010. At x = 5, tanh'(5) ≈ 0.00009. The gradient is essentially dead.
During backpropagation, the gradient at each layer is multiplied by the local activation gradient. In a 10-layer network with sigmoid activations, the gradient arriving at layer 1 is the product of 10 sigmoid derivatives.
Even at the maximum (0.25 per layer):
That's less than one millionth of the original gradient. Layer 1's weights essentially stop learning. This is the vanishing gradient problem, and it's why networks deeper than ~5 layers were nearly impossible to train before 2012.
| Layers | Sigmoid (0.25n) | Tanh best (1.0n) | Tanh typical (0.6n) |
|---|---|---|---|
| 3 | 0.0156 | 1.0 | 0.216 |
| 5 | 0.00098 | 1.0 | 0.0778 |
| 10 | 9.5 × 10-7 | 1.0 | 0.0060 |
| 20 | 9.1 × 10-13 | 1.0 | 3.7 × 10-5 |
The "tanh best" column is the theoretical maximum — all inputs at exactly zero. In practice, inputs wander away from zero, and tanh gradients shrink to ~0.6 or less. The "typical" column shows the realistic picture. Tanh is better than sigmoid, but still vanishes.
Toggle between sigmoid and tanh. Drag the input slider to see the gradient at each point. Below: watch the gradient chain shrink as layers increase.
python import numpy as np def sigmoid(x): return 1 / (1 + np.exp(-x)) def sigmoid_grad(x): s = sigmoid(x) return s * (1 - s) # max = 0.25 at x=0 def tanh_grad(x): t = np.tanh(x) return 1 - t**2 # max = 1.0 at x=0 # Chain of 10 sigmoid gradients at x=0 (best case) chain_10 = sigmoid_grad(0) ** 10 print(f"10-layer sigmoid chain: {chain_10:.2e}") # 9.54e-07 # Chain of 10 tanh gradients at x=1 (realistic) chain_10_tanh = tanh_grad(1) ** 10 print(f"10-layer tanh chain (x=1): {chain_10_tanh:.4f}") # 0.0060
In 2012, a simple insight changed everything. What if the activation function was just... a ramp? No exponentials, no saturation, no squashing. For positive inputs, pass them through unchanged. For negative inputs, output zero. That's ReLU — the Rectified Linear Unit — and it made deep learning possible.
That's the entire definition. One line. No parameters, no exponentials, no division. Just a comparison and a zero. It's the simplest nonlinear function you can imagine, and it solved the vanishing gradient problem that had plagued neural networks for decades.
The gradient of ReLU is:
For positive inputs, the gradient is exactly 1. Not 0.25 like sigmoid. Not 0.6 like typical tanh. Exactly 1. This means gradients flow through active ReLU neurons without any shrinkage.
In a 100-layer network where all neurons are active (positive inputs), the gradient chain is:
Compare with sigmoid:
That's not a typo. 6 × 10-61. ReLU gives you a gradient of 1; sigmoid gives you something smaller than the number of atoms in the observable universe (10-80 territory for deeper networks). This is why AlexNet (2012), the model that launched the deep learning era, used ReLU — and everything after it followed suit.
| x | ReLU(x) | ReLU'(x) | σ(x) | σ'(x) |
|---|---|---|---|---|
| -2 | 0 | 0 | 0.119 | 0.105 |
| -1 | 0 | 0 | 0.269 | 0.197 |
| 0 | 0 | 0 or 1 | 0.500 | 0.250 |
| 0.5 | 0.5 | 1 | 0.622 | 0.235 |
| 1 | 1 | 1 | 0.731 | 0.197 |
| 3 | 3 | 1 | 0.953 | 0.045 |
Now chain 10 of these gradients together. Assume all neurons are active (x > 0):
Six to eight orders of magnitude difference. This is why ReLU enabled deep networks. Gradients that actually reach the early layers mean early layers actually learn.
ReLU's gradient for x ≤ 0 is exactly zero. If a neuron's input becomes negative — say, due to a large negative bias or an unlucky weight update — it outputs zero, its gradient is zero, its weights never update, and it stays at zero forever. The neuron is permanently dead.
This isn't rare. In practice, 10-40% of neurons in a ReLU network can die during training. The risk factors:
Think of it this way: sigmoid neurons get "sleepy" (vanishing gradients slow learning), but ReLU neurons can "die" (zero gradient means zero learning, forever). Sleepy neurons can eventually wake up if the gradient signal is strong enough. Dead neurons cannot.
Top: ReLU and sigmoid curves with gradient overlay. Bottom: a grid of 64 neurons trained step-by-step. With ReLU, watch neurons die (go dark). With sigmoid, all survive but gradients fade. Adjust learning rate to see the effect.
python import numpy as np def relu(x): return np.maximum(0, x) # one line — the simplest activation def relu_grad(x): return (x > 0).astype(float) # 1 if positive, 0 if not # Compare gradient chains def sigmoid(x): return 1 / (1 + np.exp(-x)) x = 1.0 relu_chain_10 = relu_grad(x) ** 10 sig_chain_10 = (sigmoid(x) * (1 - sigmoid(x))) ** 10 print(f"ReLU 10-layer chain: {relu_chain_10:.1f}") # 1.0 print(f"Sigmoid 10-layer chain: {sig_chain_10:.2e}") # 7.18e-08
Dead neurons waste capacity. If 30% of your network's neurons are dead, you're paying for a 30% bigger network than you're actually using. Compute, memory, parameters — all wasted on neurons that output zero forever and will never learn again.
The fix is elegantly simple: instead of outputting exactly zero for negative inputs, let a small signal through. A tiny leak. That's Leaky ReLU.
Where α is a small positive constant, typically 0.01. For positive inputs, it behaves exactly like ReLU — gradient of 1, no saturation. For negative inputs, instead of a flat zero, the output is a gently sloping line with slope α.
The gradient:
For negative inputs, the gradient is α = 0.01. Tiny, but nonzero. Dead neurons become "drowsy" neurons — they can still receive gradient signal and eventually wake up. A neuron that got pushed into negative territory by a bad weight update can slowly recover, because 0.01 of the gradient still flows through.
Exponential Linear Unit (ELU) takes a different approach to the negative region. Instead of a straight line, it uses an exponential curve:
For large negative x, ex approaches 0, so ELU approaches -α. The function has a smooth asymptote at -α for the negative side, while being identical to ReLU for positive inputs.
The gradient:
Key difference from Leaky ReLU: the ELU gradient for very negative inputs approaches 0 (not α). At x = -1, the gradient is α · e-1 ≈ 0.368α. At x = -5, it's α · e-5 ≈ 0.0067α. This provides soft saturation — extremely negative inputs get suppressed, acting as a noise filter, while moderately negative inputs still pass gradient.
| x | ReLU | ReLU' | Leaky | Leaky' | ELU | ELU' |
|---|---|---|---|---|---|---|
| -3 | 0 | 0 | -0.03 | 0.01 | -0.950 | 0.050 |
| -1 | 0 | 0 | -0.01 | 0.01 | -0.632 | 0.368 |
| 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 3 | 3 | 1 | 3 | 1 | 3 | 1 |
Let's verify the ELU values at x = -3 step by step:
Compare that gradient of 0.050 with ReLU's 0. The ELU neuron at x = -3 is still learning — slowly, but it hasn't died. And at x = -1, ELU's gradient is 0.368 — quite healthy. Leaky ReLU's gradient at x = -1 is a constant 0.01 regardless of how negative the input gets. ELU gives stronger gradients near zero and weaker ones far away.
Parametric ReLU (PReLU), proposed by He et al. (2015), is Leaky ReLU where α is a learnable parameter. Instead of fixing α = 0.01, the network decides how much negative signal to let through during training.
In their ImageNet experiments, He et al. found that the learned α values varied by layer. Early layers learned larger α (more negative signal preserved), while later layers learned smaller α. The network was automatically tuning the activation function per layer — something hand-tuning could never achieve efficiently.
PReLU added only one parameter per channel (or per layer), so the overhead is negligible. It improved ImageNet top-5 error by ~0.5% over standard ReLU — small in absolute terms, but significant at the frontier of accuracy at the time.
All three functions on the same axes. Adjust α to see how the negative region changes. Below: gradient heatmaps showing gradient strength across the input range. Further below: the dead neuron counter from Ch2 — compare ReLU vs. Leaky ReLU vs. ELU.
python import numpy as np import torch import torch.nn as nn # From scratch def leaky_relu(x, alpha=0.01): return np.where(x > 0, x, alpha * x) def elu(x, alpha=1.0): return np.where(x > 0, x, alpha * (np.exp(x) - 1)) # PyTorch equivalents leaky = nn.LeakyReLU(negative_slope=0.01) elu_layer = nn.ELU(alpha=1.0) prelu = nn.PReLU(num_parameters=1) # alpha is learnable! x = torch.linspace(-3, 3, 7) print("Leaky:", leaky(x).data) print("ELU: ", elu_layer(x).data) print("PReLU:", prelu(x).data) print("PReLU alpha:", prelu.weight.item()) # initial: 0.25
ReLU makes a binary decision: positive inputs pass, negative inputs die. But what if instead of a hard gate, we used a soft one? What if the probability of passing an input through depended on how large it is?
Inputs of +3 almost certainly pass. Inputs of -3 almost certainly get zeroed. Inputs near 0 get a coin flip. That probabilistic interpretation is GELU — the Gaussian Error Linear Unit.
Proposed by Hendrycks and Gimpel in 2016, GELU became the default activation for BERT, GPT-2, GPT-3, and Vision Transformer (ViT). If you use a transformer today, you're almost certainly using GELU somewhere inside it.
Here's where GELU comes from. Imagine you have an input x to a neuron. Instead of passing it through directly, you multiply it by a random mask — a Bernoulli random variable that's either 0 or 1. If the mask is 1, the input passes. If 0, it's dropped. Sound familiar? That's dropout.
But dropout uses a fixed probability (like 0.1). What if the dropout probability depended on the value of x itself? Specifically: the probability that the input passes is Φ(x), the standard normal CDF — the probability that a draw from a standard normal distribution is less than or equal to x.
Large positive x? Almost all of the normal distribution is below you, so Φ(x) ≈ 1 — you pass. Large negative x? Almost none is below you, so Φ(x) ≈ 0 — you're dropped. Near zero? Φ(0) = 0.5 — a coin flip.
Where erf is the Gauss error function. Since erf can be expensive to compute, there's a fast approximation used in practice:
The tanh approximation is what frameworks like PyTorch use when you pass
approximate='tanh'. The exact version uses erf directly. Both
produce nearly identical results — the max difference is about 0.0003.
Let's compute GELU by hand at five inputs. We need Φ(x) — the standard normal CDF — which you can look up in a Z-table or compute from erf.
x = -2:
x = -1:
x = 0:
x = 1:
x = 2:
Three properties made GELU the transformer default:
1. Smooth everywhere. GELU has continuous derivatives of all orders. ReLU has a discontinuous first derivative at x=0. This smoothness means gradients never have sudden jumps, which helps optimizers like Adam maintain stable momentum estimates.
2. Non-monotonic near zero. GELU has a small dip below zero near x ≈ -0.17, where GELU(x) ≈ -0.17. This means the function isn't strictly increasing — it actually decreases slightly for small negative inputs before flattening to zero. This non-monotonicity acts as a form of built-in regularization.
3. Stochastic regularization. Because GELU can be interpreted as expected dropout, it provides an implicit regularization effect during forward passes. Empirically, BERT trained with GELU converges faster and generalizes better than with ReLU.
Drag the slider to move a probe along the x-axis. Top: GELU vs ReLU curves. Middle: the gate probability Φ(x). Bottom: gradients compared.
python import torch import math # Exact GELU using the error function def gelu_exact(x): return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0))) # Fast tanh approximation (what PyTorch uses internally) def gelu_tanh(x): return 0.5 * x * (1.0 + torch.tanh( math.sqrt(2.0 / math.pi) * (x + 0.044715 * x**3) )) # Verify against PyTorch built-in x = torch.linspace(-3, 3, 7) print("Exact: ", gelu_exact(x)) print("Approx: ", gelu_tanh(x)) print("PyTorch:", torch.nn.functional.gelu(x)) # Max difference between exact and approx: ~0.0003
In 2017, Google Brain tried something unusual. Instead of designing an activation function by hand, they used neural architecture search — an AI designing AI components. They searched over a space of simple mathematical operations and evaluated each candidate on real tasks.
The winner was remarkably simple: multiply the input by its own sigmoid. They called it Swish.
That's it. The sigmoid σ(x) acts as a gate: for large positive x, σ(x) ≈ 1 so Swish(x) ≈ x (identity). For large negative x, σ(x) ≈ 0 so Swish(x) ≈ 0 (suppression). Near zero, the sigmoid gives a smooth interpolation.
The full Swish formula includes a learnable (or fixed) parameter β:
This parameter controls the sharpness of the gate:
The Swish gradient is more complex than ReLU's but has a critical advantage: it's never zero for any finite input.
At x = 0: σ(0) = 0.5, so Swish'(0) = 0.5 · (1 + 0) = 0.5. Compare this to ReLU, whose gradient jumps from 0 to 1 at x=0. Swish has no discontinuity — a smooth transition through every input value.
Even for large negative x (like x = -10), σ(-10) ≈ 0.0000454, which is tiny but nonzero. Swish neurons never fully die — there's always a small gradient signal to nudge them back to life.
Using β = 1 (standard SiLU). We need σ(x) = 1/(1 + e-x).
x = -2:
x = -1:
x = 0:
x = 1:
x = 2:
The minimum of Swish occurs at x ≈ -1.28, where Swish(x) ≈ -0.278. This is a deeper dip than GELU's minimum of -0.17. Both are non-monotonic, but Swish allows larger negative outputs.
SiLU (Sigmoid Linear Unit) is exactly Swish with β = 1.
The name was proposed independently by Elfwing et al. in 2018. PyTorch uses
torch.nn.SiLU(). You'll see both names — they're the same function.
Where SiLU/Swish appears in the wild:
| Model | Where | Year |
|---|---|---|
| EfficientNet | All conv layers | 2019 |
| LLaMA / LLaMA 2 | FFN gate (via SwiGLU) | 2023 |
| Mistral / Mixtral | FFN gate (via SwiGLU) | 2023 |
| Stable Diffusion | U-Net conv blocks | 2022 |
| Gemma | FFN gate (via GeGLU) | 2024 |
Slide β from 0 (linear) to 5 (near-ReLU). Watch the curve morph. Below: gradient comparison with ReLU and GELU.
python import torch # SiLU / Swish (beta=1) def silu(x): return x * torch.sigmoid(x) # General Swish with adjustable beta def swish(x, beta=1.0): return x * torch.sigmoid(beta * x) # Gradient of SiLU (for understanding) def silu_grad(x): s = torch.sigmoid(x) return s * (1 + x * (1 - s)) # Verify against PyTorch built-in x = torch.linspace(-3, 3, 7) print("Ours: ", silu(x)) print("PyTorch:", torch.nn.functional.silu(x)) # Identical — SiLU IS Swish with beta=1 # Beta sweep: watch Swish morph from linear to ReLU for beta in [0, 0.5, 1, 2, 5, 20]: y = swish(torch.tensor(1.0), beta) print(f"beta={beta:4} Swish(1)={y:.4f}") # beta=0 → 0.5, beta=1 → 0.731, beta=20 → 1.000 (≈ReLU)
Every transformer has two main components per layer: attention and a feed-forward network (FFN). We've been talking about what activation goes inside the FFN. But what if the FFN's entire structure changed?
What if instead of one path through an activation, you had two paths — one for content and one for deciding what to keep? That's the Gated Linear Unit (GLU).
In the original transformer (Vaswani 2017), each layer's FFN is:
Two weight matrices. One activation. One path. Simple. The activation (ReLU or GELU) decides which features to suppress. But it makes that decision based only on the magnitude of each value independently.
A Gated Linear Unit splits the FFN into two parallel projections:
The key insight: the gate path and the content path are different linear projections of the same input. The gate learns which features to keep. The content learns what values to produce. The element-wise product lets the network learn feature selection — a much richer operation than applying an activation function element-wise.
The activation function on the gate path defines the GLU variant:
| Variant | Gate Activation | Formula | Used In |
|---|---|---|---|
| GLU | σ(x) (sigmoid) | σ(xWg) ⊙ xWup | Original (Dauphin 2017) |
| SwiGLU | SiLU/Swish | SiLU(xWg) ⊙ xWup | LLaMA, Mistral, PaLM |
| GeGLU | GELU | GELU(xWg) ⊙ xWup | Gemma, some T5 variants |
| ReGLU | ReLU | ReLU(xWg) ⊙ xWup | Experimental |
SwiGLU (Shazeer, 2020) emerged as the winner. In comprehensive experiments across language modeling benchmarks, SwiGLU beat all other GLU variants and all non-gated FFNs. Google adopted it for PaLM. Meta adopted it for LLaMA. Now every major open-weight LLM uses SwiGLU.
You might worry that GLU adds parameters. After all, it has three weight matrices (Wgate, Wup, Wdown) instead of two (W1, W2). Let's count.
Standard FFN:
GLU FFN with hidden dimension d':
To match parameter counts: 3 × d_model × d' = 2 × d_model × d_ff. Solve: d' = 2/3 × d_ff.
Let's trace a 4D input through both a standard FFN and a SwiGLU FFN. d_model = 4.
Input: x = [0.5, -1.0, 0.3, 0.8]
Standard FFN (d_ff = 8, so W1 is 4×8):
Suppose after W1, the 8D hidden vector is: h = [1.2, -0.8, 0.5, -1.5, 2.1, 0.3, -0.1, 0.9]
After ReLU: [1.2, 0, 0.5, 0, 2.1, 0.3, 0, 0.9]. Three features killed outright — the ReLU decided independently for each.
SwiGLU FFN (d' = 5 to match params — since 3×4×5 = 60 ≈ 2×4×8 = 64):
Gate path (Wgate · x): [0.8, -1.2, 0.4, -0.3, 1.5]
Content path (Wup · x): [1.1, 0.7, -0.9, 0.5, 0.2]
Apply SiLU to gate path:
Element-wise multiply (gate ⊙ content):
Hidden after gating: [0.607, -0.195, -0.216, -0.064, 0.245]. Then Wdown projects back to d_model = 4.
Notice: no features were killed. The gate modulated each value based on learned context, rather than making a binary pass/kill decision. Feature 4 has a small gate but small content too. Feature 1 has a negative gate that partially inverts the content. This is richer than ReLU could ever be.
Toggle between architectures to see data flow. Click "Compute!" to animate a sample input through the network. Adjust dimensions to see how parameter counts change.
python import torch import torch.nn as nn class StandardFFN(nn.Module): def __init__(self, d_model, d_ff): super().__init__() self.w1 = nn.Linear(d_model, d_ff, bias=False) self.w2 = nn.Linear(d_ff, d_model, bias=False) def forward(self, x): return self.w2(torch.relu(self.w1(x))) class SwiGLUFFN(nn.Module): def __init__(self, d_model, d_ff): super().__init__() # 2/3 rule: hidden_dim = 2/3 * d_ff hidden = int(2 * d_ff / 3) hidden = hidden + (256 - hidden % 256) % 256 # round up to 256 self.w_gate = nn.Linear(d_model, hidden, bias=False) self.w_up = nn.Linear(d_model, hidden, bias=False) self.w_down = nn.Linear(hidden, d_model, bias=False) def forward(self, x): gate = nn.functional.silu(self.w_gate(x)) # SiLU = Swish up = self.w_up(x) return self.w_down(gate * up) # element-wise gating # Parameter count comparison d, ff = 4096, 4 * 4096 std = StandardFFN(d, ff) glu = SwiGLUFFN(d, ff) std_p = sum(p.numel() for p in std.parameters()) glu_p = sum(p.numel() for p in glu.parameters()) print(f"Standard FFN: {std_p:,} params") # 33,554,432 print(f"SwiGLU FFN: {glu_p:,} params") # ~33,816,576 (close!)
Notice a pattern? GELU multiplies the input by its normal CDF. Swish multiplies the input by its sigmoid. Every modern activation function is the same template: x times some smooth gate function. Mish continues this pattern, and understanding the template is more valuable than memorizing individual formulas.
Every modern activation function can be written as:
Where g: ℝ → [0, 1] is a smooth function that approaches 1 for large positive x and 0 for large negative x. The x· prefix ensures the function behaves like the identity for large positive inputs. The gate ensures suppression for large negative inputs. Different gate functions give different activations:
| Activation | Gate g(x) | Formula |
|---|---|---|
| SiLU/Swish | σ(x) = 1/(1+e-x) | x · σ(x) |
| GELU | Φ(x) = 0.5(1+erf(x/√2)) | x · Φ(x) |
| Mish | tanh(softplus(x)) | x · tanh(ln(1+ex)) |
All three gates have the same shape: a smooth S-curve from 0 to 1. They differ in exactly where they transition and how quickly, but the overall behavior is nearly identical.
Proposed by Diganta Misra in 2019, Mish uses a gate built from two familiar pieces: softplus and tanh.
Let's unpack this from the inside out:
softplus(x) = ln(1 + ex) — a smooth approximation of ReLU. For large positive x, softplus(x) ≈ x. For large negative x, softplus(x) ≈ 0. At x=0, softplus(0) = ln(2) ≈ 0.693.
tanh squashes its input to the range [-1, 1]. Since softplus is always non-negative, tanh(softplus(x)) is always in [0, 1] — a valid gate. For large positive x: softplus ≈ x, tanh(x) ≈ 1, so the gate ≈ 1. For large negative x: softplus ≈ 0, tanh(0) = 0, so the gate ≈ 0. Exactly the behavior we need.
x = -2:
x = -1:
x = 0:
x = 1:
x = 2:
Compare the three activations at the same inputs:
| x | GELU | SiLU | Mish | ReLU |
|---|---|---|---|---|
| -2 | -0.046 | -0.238 | -0.253 | 0 |
| -1 | -0.159 | -0.269 | -0.303 | 0 |
| 0 | 0 | 0 | 0 | 0 |
| 1 | 0.841 | 0.731 | 0.864 | 1 |
| 2 | 1.954 | 1.762 | 1.943 | 2 |
The differences are small — concentrated in the range x ∈ [-2, 0]. For positive inputs, all three converge toward ReLU. Mish sits between GELU and SiLU for negative values but is very close to GELU for positive ones.
Mish gained adoption primarily in computer vision:
In NLP/LLMs, Mish never gained traction — GELU and SiLU were already established. But Mish proved that the x·gate(x) template is robust: you can swap in different gates and get similar performance.
Toggle each activation on/off. Top: function curves. Bottom: difference from ReLU — the deviations are tiny and concentrated near zero.
python import torch import torch.nn.functional as F import math # The x * gate(x) template def gated_activation(x, gate_fn): """All modern activations: f(x) = x * gate(x)""" return x * gate_fn(x) # Different gates def sigmoid_gate(x): return torch.sigmoid(x) # → SiLU def phi_gate(x): return 0.5 * (1 + torch.erf(x / math.sqrt(2))) # → GELU def mish_gate(x): return torch.tanh(F.softplus(x)) # → Mish # All three from the same template x = torch.linspace(-3, 3, 100) silu_out = gated_activation(x, sigmoid_gate) gelu_out = gated_activation(x, phi_gate) mish_out = gated_activation(x, mish_gate) # How different are they really? print("Max |GELU - SiLU|:", (gelu_out - silu_out).abs().max().item()) # ~0.12 print("Max |GELU - Mish|:", (gelu_out - mish_out).abs().max().item()) # ~0.03 print("Max |SiLU - Mish|:", (silu_out - mish_out).abs().max().item()) # ~0.09 # Tiny differences — all three are nearly interchangeable
We've studied each activation function in isolation. Now let's put them all on the same axes and watch them compete. The simulation below plots every function we've covered — with its gradient — so you can see at a glance how they differ in the regions that matter most: near zero, deep negative, and far positive.
The key tradeoffs become visible immediately. Sigmoid and tanh saturate on both sides. ReLU is dead on the left but perfectly linear on the right. GELU and SiLU gently curve through zero, allowing a small negative region. Mish is the smoothest of all. And SwiGLU isn't shown directly because it's a gated mechanism applied to two streams — you saw it in Chapter 6.
All functions on the same axes. Toggle each one on/off. Top panel: function values. Bottom panel: gradients (derivatives). Drag the x-marker to read exact values.
Look at the bottom panel (gradients) with all functions enabled. The picture tells the entire story of activation function evolution:
The trend is clear: each generation produced smoother, more gradient-friendly activations. The field converged on functions that are smooth at zero, have gradient near 1 for positive inputs, and allow a small controlled negative signal.
Ten chapters, eight activation functions, one gating mechanism. Here's everything compressed into a single reference table, followed by the decision tree for choosing the right activation and links to where these ideas lead next.
| Name | Formula | Gradient | Key Property | Used In |
|---|---|---|---|---|
| Sigmoid | 1/(1+e-x) | σ(1-σ), max 0.25 | Saturates both sides | Output gates, LSTM gates |
| Tanh | (ex-e-x)/(ex+e-x) | 1 - tanh2, max 1.0 | Zero-centered, saturates | LSTM state, older RNNs |
| ReLU | max(0, x) | 1 if x>0, else 0 | Dead neurons, but simple | CNNs, default choice pre-2018 |
| Leaky ReLU | x if x>0, αx if x≤0 | 1 if x>0, α if x≤0 | No dead neurons | GANs, when ReLU dies |
| ELU | x if x>0, α(ex-1) if x≤0 | 1 if x>0, αex if x≤0 | Smooth, pushes mean toward 0 | Niche use, research |
| GELU | x · Φ(x) | Φ(x) + x · φ(x) | Smooth, probabilistic gate | BERT, GPT-2, ViT |
| SiLU/Swish | x · σ(x) | σ(x)(1 + x(1-σ(x))) | Non-monotonic, gradient > 1 | EfficientNet, Stable Diffusion |
| Mish | x · tanh(softplus(x)) | Complex (see Ch 7) | Smoothest of all | YOLOv4, niche use |
| SwiGLU | SiLU(xW) ⊙ (xV) | Gated: gradient depends on both streams | Learned gating, 50% more params | LLaMA, PaLM, Mistral, GPT-4 |
Loss Functions — activations shape the forward pass; loss functions shape what the network learns. Loss Functions covers MSE, cross-entropy, contrastive losses, and when to use each.
Normalization — BatchNorm, LayerNorm, and RMSNorm work hand-in-hand with activations. They re-center inputs before the activation, preventing the saturation that killed sigmoid networks. Normalization derives each technique from scratch.
Optimizers — Adam, AdamW, and learning rate schedules determine how the gradients (shaped by activations) become weight updates. Optimizers covers the full landscape.
Transformer — the architecture where SwiGLU lives. The feed-forward network in every transformer block uses an activation function. The Transformer lesson shows the complete architecture.
Backpropagation — we talked about gradient chains through activations. Backpropagation shows the full chain rule through every layer type, not just activations.