The loss landscape of a two-layer ReLU network has infinitely many solutions — but there's a convex reformulation hiding inside.
Train a neural network twice on the same data with different random initializations. You'll get different weights both times — but both networks may achieve zero training loss. How can two completely different sets of parameters produce the same predictions?
In linear regression, there's either a unique solution or an infinite family (when the system is underdetermined). For neural networks, the situation is far more complex: the loss function is non-convex, with a complicated landscape of minima, saddle points, and plateaus. Multiple global minima exist, and gradient descent (GD) finds different ones depending on initialization.
This raises deep questions. Which solution does GD find? Is one solution "better" than another for generalization? Can we characterize all solutions? Professor Pilanci's research answers these questions for a fundamental case: the two-layer ReLU network.
Each dot is a different set of network weights found by GD from different initializations. All achieve zero training loss (orange surface = loss landscape slice). Click Run GD to watch new trajectories converge to different minima.
Let's formalize the network we'll study throughout this lesson. A two-layer ReLU network with m hidden neurons computes:
where (z)+ = max(0, z) is the ReLU activation. Let's break down every piece:
| Symbol | Shape | Role |
|---|---|---|
| x | d × 1 | Input vector (d features) |
| wj(1) | d × 1 | First-layer weight for neuron j |
| (xTwj(1))+ | scalar | ReLU of the linear projection |
| wj(2) | scalar | Second-layer weight for neuron j |
| m | — | Number of hidden neurons (width) |
Each hidden neuron computes a ReLU half-plane: it "activates" (outputs a positive value) for inputs on one side of a hyperplane defined by wj(1), and is silent (outputs zero) for the other side. The second layer combines these half-planes with signed weights.
Given n training examples {(xi, yi)}, we minimize:
This is a non-convex optimization because of the ReLU and the product wj(2) · (xTwj(1))+. Standard convex optimization theory doesn't directly apply. Or does it?
A 2-neuron network fitting 1D data. Each neuron contributes a "hinge" (shown separately). Their sum is the network output. Drag the slider to change neuron orientations.
To see why the loss surface is non-convex, consider a simple symmetry. If you negate a first-layer weight wj(1) and negate the corresponding second-layer weight wj(2), the ReLU pattern changes, but you can often find compensating changes in other neurons. Different weight configurations map to the same function.
1. Rescaling symmetry: Multiply wj(1) by α > 0 and divide wj(2) by α. The function f(x) is unchanged because (α · z)+ = α · (z)+ for α > 0.
2. Permutation symmetry: Relabeling the neurons (swapping neuron 1 and neuron 2) gives different weight vectors but the same network function.
3. Dead neurons: Any neuron with wj(2) = 0 contributes nothing, regardless of wj(1). This creates flat valleys in the loss landscape.
For overparameterized networks (more neurons than needed to fit the data), the loss landscape typically has:
A 2D slice through the high-dimensional loss surface. Bright = high loss, dark = low loss (zero). The "valley" of zero-loss solutions is visible. Click to place GD starting points and watch them converge to different spots in the valley.
Among the infinitely many solutions that achieve zero training loss, gradient descent has a preference. It doesn't pick just any solution — it systematically finds the one with minimum norm.
For linear models, this is a theorem: GD initialized at zero converges to the minimum L2-norm solution. For neural networks, the story is more nuanced but the principle holds.
Consider the overparameterized regime where many solutions achieve zero loss. GD initialized near zero moves through parameter space along a path that minimizes the distance traveled. The result is the solution closest to the origin (in parameter space).
Smaller norm weights produce smoother functions (less extreme outputs). This is why overparameterized networks can generalize despite having more parameters than data points: GD's implicit bias toward min-norm solutions acts as regularization. The network could memorize random noise (it has the capacity), but GD "prefers" the simpler, smoother fit.
This observation is the bridge to the convex reformulation: if GD finds the min-norm solution, and we can write the min-norm problem as a convex optimization, then we've characterized GD's output exactly using convex theory.
The gray region shows all zero-loss solutions (feasible set). The orange dot is GD's starting point (near origin). Watch GD converge to the closest point on the feasible set — the minimum-norm solution. Try different starting points.
This is the key theoretical result. Pilanci and Ergen (2020) showed that the non-convex training problem for a two-layer ReLU network can be exactly reformulated as a convex optimization — specifically, a group LASSO problem.
A ReLU neuron with weight w(1) partitions the input space into two halves: where xTw(1) ≥ 0 (neuron active) and where xTw(1) < 0 (neuron silent). For n training points, each neuron has a specific activation pattern: which data points activate it.
There are at most 2n possible activation patterns (each data point is either on or off). Denote the set of all feasible patterns as P = {D1, D2, ..., DP}, where Dk is a diagonal 0/1 matrix indicating which data points are active for pattern k.
The non-convex min-norm problem:
can be reformulated as:
where uk ∈ Rd are the new convex variables. This is a convex problem (minimizing a sum of norms subject to linear constraints)!
The trick exploits the positive homogeneity of ReLU: (αz)+ = α(z)+ for α > 0. This means we can "absorb" the magnitude of w(1) into w(2). Once we fix the activation pattern (which data points are on/off), the problem becomes linear in the remaining parameters. Summing over all possible patterns gives the full convex program.
5 data points (orange dots) in 2D. Each gray line is a possible neuron orientation — data points above the line are "active" (filled), below are "silent" (hollow). Each line defines a different activation pattern. Click to add data points.
The convex reformulation min ∑||uk||2 subject to linear constraints has a name in statistics: it's a group LASSO problem. This connection to a well-studied framework gives us powerful analytical tools.
Standard LASSO minimizes ||y − Xβ||2 + λ||β||1, promoting sparsity (many coefficients exactly zero). Group LASSO promotes group sparsity: entire groups of variables are either all zero or all nonzero.
Here each "group" uk corresponds to one activation pattern. The group LASSO penalty encourages most groups to be zero — meaning most activation patterns are unused. Only a few patterns get nonzero weight. This is exactly sparsity in the number of effective neurons!
As λ varies from large to small:
| λ | Effect | Network interpretation |
|---|---|---|
| Large | Heavy penalty, very few active groups | Network with 1-2 neurons (underfitting) |
| Medium | Balanced, moderate number of active groups | Well-regularized network |
| Small | Weak penalty, many active groups | Wide network (potential overfitting) |
| 0 | No penalty, min-norm solution | What GD finds (implicit regularization) |
Each colored line is one "group" (activation pattern). As λ decreases, more groups activate. The number of nonzero groups = the effective network width. Adjust λ to see the sparsity pattern.
Classical statistical wisdom says: more parameters = more overfitting. The bias-variance tradeoff predicts that test error decreases as model capacity grows (reducing bias), then increases past a certain point (increasing variance). But neural networks violate this: after the traditional peak, test error decreases again as we add even more parameters.
This is double descent: the test error curve has two descents, separated by a peak at the interpolation threshold (where the number of parameters equals the number of data points).
| Regime | Parameters vs Data | Behavior |
|---|---|---|
| Under-parameterized | p < n | Classical: more params helps (reduces bias) |
| Interpolation threshold | p ≈ n | Peak: the model barely fits the data, very sensitive to noise |
| Over-parameterized | p >> n | Modern: more params helps again (GD finds simpler/min-norm solution) |
With more neurons than data points:
The more overparameterized we are, the "more room" GD has to find a smooth solution. This is the second descent.
Test error (orange) and training error (teal) as network width increases. The vertical dashed line marks the interpolation threshold (width = n). Adjust the noise level to see how it amplifies the peak.
The grand finale. Below you can explore the loss landscape of a two-layer ReLU network interactively. Watch GD trajectories from different initializations converge to different solutions in the zero-loss valley — all arriving at the min-norm solution cluster.
Click anywhere to place a GD starting point. Watch it descend to the zero-loss valley and settle at the min-norm solution. Multiple runs converge to the same spot despite starting from different locations.
Bar chart comparing the L2 norm of different solutions found by GD. Despite landing at different points in the loss landscape, they all have similar (minimal) norm — confirming the implicit bias.
The actual function f(x) computed by each GD solution. Despite having different weights, they produce nearly identical functions (the min-norm interpolant). Training data shown as dots.