Same accuracy, 20× less compute. The difference is architecture design.
Two image classifiers. Both achieve 76% accuracy on ImageNet. Model A uses 10 billion FLOPs per image. Model B uses 500 million FLOPs. Same answer, 20× less compute. Model B runs on your phone in real time. Model A needs a data center GPU.
The difference? Architecture design. Not training tricks, not better data, not bigger GPUs. Pure structural decisions about how information flows through the network.
Here's the fundamental equation of computational cost:
A badly designed layer can waste 90% of its FLOPs on redundant computation — multiplying numbers that contribute almost nothing to the final answer. A well-designed layer computes only what matters.
Let's make this concrete. A standard 3×3 convolution with 256 input channels and 256 output channels on a 56×56 feature map costs:
That's almost 2 billion multiply-adds for a single layer. A typical network has 50+ such layers. The question isn't "can we afford this?" — it's "do we need all of it?"
Adjust kernel size, channels, and spatial resolution to see how FLOPs explode for standard convolutions. The teal bar shows depthwise separable cost for comparison.
Notice how the standard convolution bar grows cubically with channels (K² × Cin × Cout), while the efficient alternative grows only linearly. This gap is the opportunity that efficient architectures exploit.
The single most important efficiency trick in deep learning is the depthwise separable convolution. It splits one expensive operation into two cheap ones — and loses almost nothing in accuracy.
Let's understand what a standard convolution actually computes. At each spatial position, a K×K×Cin patch is dot-producted with a filter of the same size, producing one output value. You need Cout such filters to produce all output channels.
The key insight: spatial mixing and channel mixing don't have to happen at the same time. We can separate them.
Apply a separate K×K filter to EACH input channel independently. No cross-channel interaction. Each channel gets its own spatial filter.
Apply 1×1 convolutions to mix channels. This is just a matrix multiply at each spatial position — a linear projection from Cin dimensions to Cout dimensions.
For K=3, Cout=256: savings = 9×256 / (9+256) = 2304/265 ≈ 8.7×
python import torch.nn as nn # Standard convolution: mixes spatial + channels together standard = nn.Conv2d(256, 256, kernel_size=3, padding=1) # Parameters: 3*3*256*256 = 589,824 # Depthwise separable: two steps depthwise = nn.Conv2d(256, 256, kernel_size=3, padding=1, groups=256) pointwise = nn.Conv2d(256, 256, kernel_size=1) # Parameters: 3*3*256 + 256*256 = 2,304 + 65,536 = 67,840 # That's 8.7x fewer parameters too! def depthwise_separable(x): x = depthwise(x) # spatial filtering per channel x = pointwise(x) # channel mixing return x
The standard conv is one monolithic block. The separable version splits into two small blocks. Adjust channels to see how the ratio changes.
In 2017, Google published MobileNetV1: stack depthwise separable convolutions, add batch normalization and ReLU between each step, and you get a network that's 8× cheaper than VGG with similar accuracy. Simple and beautiful.
But V1 had a problem. The depthwise step operates on each channel independently — it can't learn cross-channel features at that stage. If you have a narrow bottleneck (few channels), the depthwise filter is starved of information. You need enough channels for the spatial filter to work with.
The solution is counterintuitive. A standard residual block (ResNet) goes wide→narrow→wide: compress channels with 1×1, do the expensive 3×3 in the narrow space, then expand back. MobileNetV2 does the opposite.
Why "inverted"? Because the residual connection skips across the narrow bottleneck, not the wide inner representation. The data lives in the skinny state; the fat state is temporary, just for spatial filtering.
The expansion factor t (typically 6) controls how wide the internal representation gets. With t=6 and a 24-channel bottleneck, the depthwise step works on 144 channels — plenty of capacity for rich spatial features.
Input: 24 channels, 56×56 spatial, expansion t=6.
Compare to a standard 3×3 conv with 144 input and 144 output channels at 56×56:
The inverted residual is 23× cheaper — while maintaining similar representational capacity.
python class InvertedResidual(nn.Module): def __init__(self, c_in, c_out, stride, expand_ratio): super().__init__() c_mid = c_in * expand_ratio self.use_residual = (stride == 1 and c_in == c_out) layers = [] if expand_ratio != 1: # Expand: 1x1 conv to widen channels layers += [nn.Conv2d(c_in, c_mid, 1), nn.BatchNorm2d(c_mid), nn.ReLU6()] # Depthwise: 3x3 spatial filtering layers += [nn.Conv2d(c_mid, c_mid, 3, stride, padding=1, groups=c_mid), nn.BatchNorm2d(c_mid), nn.ReLU6()] # Project: 1x1 conv to narrow channels (NO activation!) layers += [nn.Conv2d(c_mid, c_out, 1), nn.BatchNorm2d(c_out)] self.conv = nn.Sequential(*layers) def forward(self, x): if self.use_residual: return x + self.conv(x) return self.conv(x)
Left: standard residual (wide→narrow→wide). Right: inverted residual (narrow→wide→narrow). The residual connection spans the bottleneck in both cases.
You have a baseline network that works. Now you want to make it bigger for higher accuracy. You have three knobs to turn: make it deeper (more layers), wider (more channels per layer), or increase resolution (bigger input images). Which do you turn?
Most people pick one. ResNet scales depth (18→34→50→101→152). WideResNet scales width. But the EfficientNet paper (Tan & Le, 2019) discovered something profound: scaling all three together, in a specific ratio, is dramatically better than scaling any one dimension alone.
Why those exponents? FLOPs scale as d × w² × r² (depth is linear, width and resolution are quadratic in their effect on compute). The constraint α·β²·γ² ≈ 2 ensures that each unit increase in φ roughly doubles the total FLOPs.
Check: 1.2 × 1.1² × 1.15² = 1.2 × 1.21 × 1.3225 = 1.92 ≈ 2 ✓
EfficientNet-B0 baseline: 18 layers, 32 base width, 224×224 input.
For B3, φ = 3:
The result: EfficientNet-B3 achieves 81.6% ImageNet accuracy with only 1.8B FLOPs — compared to ResNet-152 at 78.3% accuracy with 11.6B FLOPs. Higher accuracy, 6.4× less compute.
| Model | Top-1 Acc | FLOPs | φ |
|---|---|---|---|
| EfficientNet-B0 | 77.1% | 0.39B | 0 |
| EfficientNet-B1 | 79.1% | 0.70B | 1 |
| EfficientNet-B2 | 80.1% | 1.0B | 2 |
| EfficientNet-B3 | 81.6% | 1.8B | 3 |
| EfficientNet-B4 | 82.9% | 4.2B | 4 |
| EfficientNet-B7 | 84.3% | 37B | 7 |
| ResNet-152 | 78.3% | 11.6B | — |
Drag the φ slider to scale the network. Watch depth, width, and resolution grow together. The cube's volume represents total FLOPs.
python import math # EfficientNet compound scaling alpha = 1.2 # depth coefficient beta = 1.1 # width coefficient gamma = 1.15 # resolution coefficient def scale_model(phi, base_depth=18, base_width=32, base_res=224): d = math.ceil(base_depth * alpha**phi) w = math.ceil(base_width * beta**phi) r = math.ceil(base_res * gamma**phi) flop_mult = 2**phi # approximately return {'depth': d, 'width': w, 'resolution': r, 'flop_mult': flop_mult} # EfficientNet-B3 b3 = scale_model(phi=3) print(b3) # {'depth': 32, 'width': 43, 'resolution': 341, 'flop_mult': 8}
What if instead of a human designing the architecture, we let an algorithm search for one? That's Neural Architecture Search (NAS) — automated discovery of optimal network structures.
The setup: define a search space (what operations are allowed? how many layers? which connections?), a search strategy (how do we explore?), and an evaluation method (how do we score each candidate?).
A typical cell-based search space (NASNet) defines:
Naive NAS (Zoph & Le, 2017): train each candidate architecture to convergence. Evaluate thousands. At 1 GPU-hour per evaluation, that's thousands of GPU-hours — absolutely impractical for most teams.
The breakthrough: train a single supernet that contains ALL possible architectures as sub-networks. Each candidate is a subset of the supernet's weights. To evaluate a candidate, just mask the supernet to that subset — no retraining needed.
python # Simplified one-shot NAS supernet class SupernetCell(nn.Module): def __init__(self, channels): super().__init__() self.ops = nn.ModuleList([ nn.Conv2d(channels, channels, 3, padding=1), # op 0: 3x3 conv nn.Conv2d(channels, channels, 5, padding=2), # op 1: 5x5 conv nn.Conv2d(channels, channels, 3, padding=1, # op 2: depthwise groups=channels), nn.MaxPool2d(3, stride=1, padding=1), # op 3: max pool nn.Identity(), # op 4: skip ]) def forward(self, x, arch_choice): # arch_choice selects which operation to use return self.ops[arch_choice](x) # During search: sample arch_choice randomly per step # After search: fix arch_choice to best found architecture
Watch architectures being proposed, evaluated, and selected. Green = high accuracy, red = low. The population evolves toward better designs.
You've designed an efficient architecture. You've trained it in PyTorch or TensorFlow on a beefy GPU server. Now you need to run it on a phone, a drone, or an embedded sensor. The gap between "works in my notebook" and "runs at 30fps on an iPhone" is enormous.
The problem: different hardware requires different representations. Your GPU training framework stores the model as a Python object graph with floating-point weights. A mobile chip needs a compact binary with quantized weights and fused operations.
| Format | Target | Key Feature |
|---|---|---|
| ONNX | Interchange | Framework-agnostic graph format |
| TFLite | Android/embedded | Quantization-aware, small binary |
| CoreML | Apple devices | Neural Engine acceleration |
| TensorRT | NVIDIA GPUs | Kernel fusion, FP16/INT8 |
| NNAPI | Android NPUs | Hardware abstraction layer |
Before deploying, the compiler performs transformations that don't change the math but dramatically speed up execution:
The biggest single deployment optimization. Convert FP32 weights and activations to INT8:
This gives 4× memory reduction, 2-4× speedup on INT8-capable hardware, and typically less than 1% accuracy loss with proper calibration.
python import torch import onnx import onnxruntime as ort # Step 1: Export PyTorch model to ONNX model.eval() dummy = torch.randn(1, 3, 224, 224) torch.onnx.export(model, dummy, "model.onnx", input_names=["image"], output_names=["logits"], dynamic_axes={"image": {0: "batch"}}) # Step 2: Optimize with ONNX Runtime sess_options = ort.SessionOptions() sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL session = ort.InferenceSession("model.onnx", sess_options) # Step 3: Quantize to INT8 from onnxruntime.quantization import quantize_dynamic quantize_dynamic("model.onnx", "model_int8.onnx", weight_type=QuantType.QInt8)
The journey from training to inference. Green stages are hardware-agnostic; orange stages are target-specific.
Architecture efficiency doesn't exist in a vacuum — it exists relative to hardware. An operation that's "efficient" in FLOPs might be slow if the hardware can't execute it well. Understanding hardware is how you design architectures that are fast in practice, not just in theory.
| Hardware | Strength | Peak TOPS | Good At |
|---|---|---|---|
| CPU | Flexibility | ~1 (INT8) | Control flow, small batches |
| GPU (CUDA) | Parallelism | ~312 (A100) | Large matrix multiply, batch inference |
| Tensor Core | Matrix ops | ~624 (FP16) | 4x4 matmul blocks, transformer layers |
| TPU | Systolic arrays | ~275 (v4) | Dense matmul, training at scale |
| Mobile NPU | Efficiency | ~15 (A17) | Depthwise conv, quantized inference |
Performance is limited by either compute (how fast the hardware can multiply) or memory bandwidth (how fast data can be moved). The arithmetic intensity (FLOPs per byte loaded) determines which bottleneck dominates.
Dense matrix multiply has high arithmetic intensity (~O(N) FLOPs per byte) — it's compute-bound on modern hardware. Depthwise convolution has low arithmetic intensity (only K² FLOPs per element loaded) — it's often memory-bound. This is why depthwise conv is "efficient" in FLOPs but not always fast in practice.
python # Arithmetic intensity examples # Dense matmul: C = A @ B, A is MxK, B is KxN # FLOPs: 2*M*K*N # Bytes: (M*K + K*N + M*N) * bytes_per_elem # Intensity: 2*M*K*N / ((M*K + K*N + M*N) * 4) ≈ N/2 for large square # Depthwise 3x3 conv: C channels, HxW spatial # FLOPs: 9 * C * H * W # Bytes: (C*H*W input + 9*C weights + C*H*W output) * 4 # Intensity: 9*C*H*W / (2*C*H*W + 9*C) * 4 ≈ 9/8 ≈ 1.1 # Very low! Memory-bound on most hardware. def arithmetic_intensity(op_type, **kwargs): if op_type == "matmul": M, K, N = kwargs['M'], kwargs['K'], kwargs['N'] flops = 2 * M * K * N bytes_moved = (M*K + K*N + M*N) * 4 elif op_type == "depthwise": C, H, W, K = kwargs['C'], kwargs['H'], kwargs['W'], kwargs['K'] flops = K*K * C * H * W bytes_moved = (2*C*H*W + K*K*C) * 4 return flops / bytes_moved
Each hardware platform has a different roofline. Operations below the line are memory-bound; at the line they are compute-bound. Hover to see where common layers fall.
This is the capstone. Everything comes together: depthwise separable blocks, compound scaling, hardware awareness, and deployment constraints. You're the architect now. Your job: design a network that meets a specific latency budget on a specific device while maximizing accuracy.
The key insight of hardware-aware design: there's no single "best" architecture. The best architecture depends on:
| Device | Preferred Blocks | Precision | Max Channels |
|---|---|---|---|
| A100 GPU | Standard conv, attention | FP16/TF32 | 2048+ |
| iPhone NPU | Depthwise sep, inverted res | INT8/FP16 | 512 |
| Raspberry Pi | Thin depthwise, no attention | INT8 | 128 |
Select a device, task, and latency budget. Adjust architecture parameters and watch predicted performance metrics change. Design a model that fits your constraints!
Efficient convolutions were the story from 2017-2020. The frontier has moved. Vision Transformers, state-space models, and hardware-software co-optimizations are defining the next generation of efficient architectures.
The Vision Transformer (ViT) splits an image into 16×16 patches, projects each to an embedding, and applies transformer layers. The problem: self-attention is O(N²) in sequence length. For a 224×224 image with 16×16 patches, N = 196. Manageable. But for higher resolution or dense prediction, N explodes.
Efficiency tricks:
Mamba (Gu & Dao, 2023) offers an entirely different approach to sequence modeling. Instead of attention (O(N²)) or even linear attention (O(N) but with reduced capacity), Mamba uses a selective state-space model that processes sequences in O(N) time with O(1) memory per step during inference.
The key innovation: A, B, C are input-dependent (selective), allowing the model to decide what to remember and what to forget — similar to a gated RNN, but with the parallelizable structure of SSMs during training.
FlashAttention doesn't change the math of attention — it changes how it's computed to exploit the GPU memory hierarchy. Standard attention materializes the N×N attention matrix in GPU HBM (slow global memory). FlashAttention tiles the computation so that it stays in SRAM (fast on-chip memory).
The speedup: 2-4× faster, 5-20× less memory. This isn't approximation — it's exact attention, just computed more cleverly relative to hardware.
python # FlashAttention: exact same math, hardware-aware implementation # Standard (slow, O(N²) memory): # attn = softmax(Q @ K.T / sqrt(d)) @ V # FlashAttention (fast, O(N) memory): # Tiles Q, K, V into blocks that fit in SRAM # Computes attention block-by-block using online softmax # Never materializes the full N×N matrix from flash_attn import flash_attn_func # Drop-in replacement: same input/output, 2-4x faster output = flash_attn_func(q, k, v, causal=True) # Memory comparison for sequence length 4096: # Standard: 4096 × 4096 × 2 bytes = 32 MB per head # FlashAttention: O(block_size) ≈ 256 KB per head
Each dot is a model. X-axis: FLOPs. Y-axis: accuracy. The frontier moves up-and-left over time as architectures get more efficient.
You now understand the full stack of efficient architecture design: from the fundamental compute savings of depthwise separable convolutions, through compound scaling and automated search, to hardware-aware deployment and emerging paradigms.
| Architecture | Params | FLOPs | Top-1 | Mobile Latency |
|---|---|---|---|---|
| MobileNetV2 1.0 | 3.4M | 300M | 72.0% | ~6ms |
| MobileNetV3-Small | 2.5M | 56M | 67.4% | ~3ms |
| EfficientNet-B0 | 5.3M | 390M | 77.1% | ~12ms |
| EfficientNet-B3 | 12M | 1.8B | 81.6% | ~45ms |
| Swin-Tiny | 28M | 4.5B | 81.3% | ~90ms |
| ConvNeXt-Tiny | 28M | 4.5B | 82.1% | ~85ms |
| ResNet-50 | 25M | 4.1B | 76.1% | ~70ms |
Let's formally prove the savings ratio. For a layer with kernel K, Cin input channels, Cout output channels, and H×W spatial:
As Cout → ∞, ratio → K². For K=3: maximum savings is 9×. For typical Cout=256: savings = 9×256/(9+256) = 8.7×. For Cout=64: savings = 9×64/(9+64) = 7.9×. The savings improve with more output channels.
This lesson connects to many others in the series: