EE269 Lecture 28 — Mert Pilanci, Stanford

Convex Neural Networks

The loss landscape of a two-layer ReLU network has infinitely many solutions — but there's a convex reformulation hiding inside.

Prerequisites: Neural networks (forward pass, ReLU) + Basic optimization (gradient descent). That's it.
8
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: Why Multiple Solutions?

Train a neural network twice on the same data with different random initializations. You'll get different weights both times — but both networks may achieve zero training loss. How can two completely different sets of parameters produce the same predictions?

In linear regression, there's either a unique solution or an infinite family (when the system is underdetermined). For neural networks, the situation is far more complex: the loss function is non-convex, with a complicated landscape of minima, saddle points, and plateaus. Multiple global minima exist, and gradient descent (GD) finds different ones depending on initialization.

This raises deep questions. Which solution does GD find? Is one solution "better" than another for generalization? Can we characterize all solutions? Professor Pilanci's research answers these questions for a fundamental case: the two-layer ReLU network.

The big revelation: For two-layer ReLU networks, the non-convex training problem can be exactly reformulated as a convex optimization problem. This means we can characterize the entire set of optimal solutions, understand which one GD selects, and explain phenomena like double descent. Convex optimization theory — well-understood for 60+ years — suddenly applies to neural networks.
Many Solutions, One Loss

Each dot is a different set of network weights found by GD from different initializations. All achieve zero training loss (orange surface = loss landscape slice). Click Run GD to watch new trajectories converge to different minima.

Why can two neural networks with completely different weights make the same predictions?

Chapter 1: Two-Layer ReLU Networks

Let's formalize the network we'll study throughout this lesson. A two-layer ReLU network with m hidden neurons computes:

f(x) = ∑j=1m wj(2) · (xT wj(1))+

where (z)+ = max(0, z) is the ReLU activation. Let's break down every piece:

SymbolShapeRole
xd × 1Input vector (d features)
wj(1)d × 1First-layer weight for neuron j
(xTwj(1))+scalarReLU of the linear projection
wj(2)scalarSecond-layer weight for neuron j
mNumber of hidden neurons (width)

Each hidden neuron computes a ReLU half-plane: it "activates" (outputs a positive value) for inputs on one side of a hyperplane defined by wj(1), and is silent (outputs zero) for the other side. The second layer combines these half-planes with signed weights.

Think of each neuron as a "hinge." In 1D, ReLU(xTw) creates a piecewise-linear function that's zero on one side and linear on the other — like a hinge. With enough hinges placed at different angles, you can approximate any function. The first layer decides where to place the hinges; the second layer decides how to combine them.

The Training Problem

Given n training examples {(xi, yi)}, we minimize:

minW(1), W(2) (1/n) ∑i=1n (f(xi) − yi)2

This is a non-convex optimization because of the ReLU and the product wj(2) · (xTwj(1))+. Standard convex optimization theory doesn't directly apply. Or does it?

Two-Layer ReLU in Action

A 2-neuron network fitting 1D data. Each neuron contributes a "hinge" (shown separately). Their sum is the network output. Drag the slider to change neuron orientations.

Neuron 1 angle 1.0
Neuron 2 angle -1.5
In a two-layer ReLU network f(x) = ∑ wj(2)(xTwj(1))+, what makes the training problem non-convex?

Chapter 2: The Non-Convex Landscape

To see why the loss surface is non-convex, consider a simple symmetry. If you negate a first-layer weight wj(1) and negate the corresponding second-layer weight wj(2), the ReLU pattern changes, but you can often find compensating changes in other neurons. Different weight configurations map to the same function.

Sources of Non-Convexity

1. Rescaling symmetry: Multiply wj(1) by α > 0 and divide wj(2) by α. The function f(x) is unchanged because (α · z)+ = α · (z)+ for α > 0.

2. Permutation symmetry: Relabeling the neurons (swapping neuron 1 and neuron 2) gives different weight vectors but the same network function.

3. Dead neurons: Any neuron with wj(2) = 0 contributes nothing, regardless of wj(1). This creates flat valleys in the loss landscape.

Non-convex doesn't mean non-solvable. Despite these symmetries and non-convexities, gradient descent reliably finds good solutions in practice. This was a mystery for decades: why does GD work on non-convex problems? Pilanci's convex reformulation explains it — under the surface, the optimization has convex structure.

What the Landscape Looks Like

For overparameterized networks (more neurons than needed to fit the data), the loss landscape typically has:

Loss Landscape Cross-Section

A 2D slice through the high-dimensional loss surface. Bright = high loss, dark = low loss (zero). The "valley" of zero-loss solutions is visible. Click to place GD starting points and watch them converge to different spots in the valley.

For overparameterized two-layer ReLU networks, the set of global minima (zero training loss) is typically:

Chapter 3: GD Finds Min-Norm

Among the infinitely many solutions that achieve zero training loss, gradient descent has a preference. It doesn't pick just any solution — it systematically finds the one with minimum norm.

The Implicit Bias of Gradient Descent

For linear models, this is a theorem: GD initialized at zero converges to the minimum L2-norm solution. For neural networks, the story is more nuanced but the principle holds.

Consider the overparameterized regime where many solutions achieve zero loss. GD initialized near zero moves through parameter space along a path that minimizes the distance traveled. The result is the solution closest to the origin (in parameter space).

GD converges to: argmin ||W||2   subject to   fW(xi) = yi   ∀i
Think of GD as a "lazy" optimizer. It starts at the origin (or near it) and takes the shortest path to a zero-loss solution. Among all possible solutions, it picks the "simplest" one in the sense of having the smallest weights. This is an implicit regularization — GD acts like weight decay without explicitly adding a penalty.

Why Min-Norm Matters for Generalization

Smaller norm weights produce smoother functions (less extreme outputs). This is why overparameterized networks can generalize despite having more parameters than data points: GD's implicit bias toward min-norm solutions acts as regularization. The network could memorize random noise (it has the capacity), but GD "prefers" the simpler, smoother fit.

From Norm to Convex Program

This observation is the bridge to the convex reformulation: if GD finds the min-norm solution, and we can write the min-norm problem as a convex optimization, then we've characterized GD's output exactly using convex theory.

GD Trajectory to Min-Norm Solution

The gray region shows all zero-loss solutions (feasible set). The orange dot is GD's starting point (near origin). Watch GD converge to the closest point on the feasible set — the minimum-norm solution. Try different starting points.

Learning rate 0.10
Among all solutions that achieve zero training loss, gradient descent (initialized near zero) tends to find:

Chapter 4: Convex Reformulation

This is the key theoretical result. Pilanci and Ergen (2020) showed that the non-convex training problem for a two-layer ReLU network can be exactly reformulated as a convex optimization — specifically, a group LASSO problem.

The Key Idea: Enumerate ReLU Activation Patterns

A ReLU neuron with weight w(1) partitions the input space into two halves: where xTw(1) ≥ 0 (neuron active) and where xTw(1) < 0 (neuron silent). For n training points, each neuron has a specific activation pattern: which data points activate it.

There are at most 2n possible activation patterns (each data point is either on or off). Denote the set of all feasible patterns as P = {D1, D2, ..., DP}, where Dk is a diagonal 0/1 matrix indicating which data points are active for pattern k.

Dk = diag(dk1, ..., dkn),    dki ∈ {0, 1}

The Convex Program

The non-convex min-norm problem:

min ∑j (||wj(1)||2 + |wj(2)|)   s.t.   ∑j wj(2) Dj X wj(1) = y

can be reformulated as:

min ∑k=1P ||uk||2   s.t.   ∑k=1P Dk X uk = y

where uk ∈ Rd are the new convex variables. This is a convex problem (minimizing a sum of norms subject to linear constraints)!

What just happened? We replaced the optimization over neuron weights (non-convex because of ReLU and product structure) with an optimization over activation-pattern-weighted vectors (convex). Each uk represents "how much and in what direction does a neuron with pattern k contribute." The non-convexity was absorbed into the enumeration of patterns.

Why This Works

The trick exploits the positive homogeneity of ReLU: (αz)+ = α(z)+ for α > 0. This means we can "absorb" the magnitude of w(1) into w(2). Once we fix the activation pattern (which data points are on/off), the problem becomes linear in the remaining parameters. Summing over all possible patterns gives the full convex program.

Activation Patterns

5 data points (orange dots) in 2D. Each gray line is a possible neuron orientation — data points above the line are "active" (filled), below are "silent" (hollow). Each line defines a different activation pattern. Click to add data points.

The convex reformulation works by:

Chapter 5: Group LASSO Connection

The convex reformulation min ∑||uk||2 subject to linear constraints has a name in statistics: it's a group LASSO problem. This connection to a well-studied framework gives us powerful analytical tools.

What Is Group LASSO?

Standard LASSO minimizes ||y − Xβ||2 + λ||β||1, promoting sparsity (many coefficients exactly zero). Group LASSO promotes group sparsity: entire groups of variables are either all zero or all nonzero.

minu (1/2) ||y − ∑k Ak uk||2 + λ ∑k ||uk||2

Here each "group" uk corresponds to one activation pattern. The group LASSO penalty encourages most groups to be zero — meaning most activation patterns are unused. Only a few patterns get nonzero weight. This is exactly sparsity in the number of effective neurons!

Sparsity in patterns = few neurons. The group LASSO naturally selects a small number of active patterns (nonzero uk). Each active pattern corresponds to one neuron in the final network. So the convex formulation automatically determines the network width! You don't need to choose the number of neurons — the optimization does it for you.

The Regularization Path

As λ varies from large to small:

λEffectNetwork interpretation
LargeHeavy penalty, very few active groupsNetwork with 1-2 neurons (underfitting)
MediumBalanced, moderate number of active groupsWell-regularized network
SmallWeak penalty, many active groupsWide network (potential overfitting)
0No penalty, min-norm solutionWhat GD finds (implicit regularization)
Group LASSO Regularization Path

Each colored line is one "group" (activation pattern). As λ decreases, more groups activate. The number of nonzero groups = the effective network width. Adjust λ to see the sparsity pattern.

log(λ) 1.0
The group LASSO connection tells us that the convex reformulation naturally promotes:

Chapter 6: Double Descent

Classical statistical wisdom says: more parameters = more overfitting. The bias-variance tradeoff predicts that test error decreases as model capacity grows (reducing bias), then increases past a certain point (increasing variance). But neural networks violate this: after the traditional peak, test error decreases again as we add even more parameters.

This is double descent: the test error curve has two descents, separated by a peak at the interpolation threshold (where the number of parameters equals the number of data points).

The Three Regimes

RegimeParameters vs DataBehavior
Under-parameterizedp < nClassical: more params helps (reduces bias)
Interpolation thresholdp ≈ nPeak: the model barely fits the data, very sensitive to noise
Over-parameterizedp >> nModern: more params helps again (GD finds simpler/min-norm solution)
The convex framework explains double descent. At the interpolation threshold (p = n), the system of equations has exactly one solution — and that one solution is forced to fit both signal and noise (high test error). Beyond the threshold (p > n), infinitely many solutions exist, and GD picks the min-norm one — which is smoother and generalizes better. The convex reformulation makes this rigorous: the group LASSO regularization path shows exactly how over-parameterization enables implicit regularization.

Why Over-Parameterization Helps

With more neurons than data points:

The more overparameterized we are, the "more room" GD has to find a smooth solution. This is the second descent.

Double Descent Curve

Test error (orange) and training error (teal) as network width increases. The vertical dashed line marks the interpolation threshold (width = n). Adjust the noise level to see how it amplifies the peak.

Noise level 0.5
Why does test error decrease again in the overparameterized regime (second descent)?

Chapter 7: Loss Landscape Showcase

The grand finale. Below you can explore the loss landscape of a two-layer ReLU network interactively. Watch GD trajectories from different initializations converge to different solutions in the zero-loss valley — all arriving at the min-norm solution cluster.

What you'll see: The heatmap shows the loss landscape (dark = low loss). The blue contour marks the zero-loss valley. Each orange trajectory is a GD run from a random initialization. They all converge to the same region of the valley: the min-norm point closest to the origin. This is the implicit bias in action.
Interactive Loss Landscape

Click anywhere to place a GD starting point. Watch it descend to the zero-loss valley and settle at the min-norm solution. Multiple runs converge to the same spot despite starting from different locations.

Learning Rate 0.050
GD Steps 100
Solution Norm Comparison

Bar chart comparing the L2 norm of different solutions found by GD. Despite landing at different points in the loss landscape, they all have similar (minimal) norm — confirming the implicit bias.

Network Functions

The actual function f(x) computed by each GD solution. Despite having different weights, they produce nearly identical functions (the min-norm interpolant). Training data shown as dots.