Efficient Architectures — From Brute Force to Brilliant Design

Chapter 0: Why Architecture Matters

Two image classifiers. Both achieve 76% accuracy on ImageNet. Model A uses 10 billion FLOPs per image. Model B uses 500 million FLOPs. Same answer, 20× less compute. Model B runs on your phone in real time. Model A needs a data center GPU.

The difference? Architecture design. Not training tricks, not better data, not bigger GPUs. Pure structural decisions about how information flows through the network.

Here's the fundamental equation of computational cost:

Total FLOPs = ∑_layers (operations per position) × (spatial positions) × (channels)

A badly designed layer can waste 90% of its FLOPs on redundant computation — multiplying numbers that contribute almost nothing to the final answer. A well-designed layer computes only what matters.

The core insight: Compute is not free. Every multiply-add costs energy, time, and money. Architecture efficiency means getting the same representational power with fewer operations — not by being approximate, but by being structurally clever.

Let's make this concrete. A standard 3×3 convolution with 256 input channels and 256 output channels on a 56×56 feature map costs:

FLOPs = 3 × 3 × 256 × 256 × 56 × 56 = 1.85 billion

That's almost 2 billion multiply-adds for a single layer. A typical network has 50+ such layers. The question isn't "can we afford this?" — it's "do we need all of it?"

FLOP Calculator: Standard vs Efficient

Adjust kernel size, channels, and spatial resolution to see how FLOPs explode for standard convolutions. The teal bar shows depthwise separable cost for comparison.

Kernel 3

Channels 256

Spatial 56

Notice how the standard convolution bar grows cubically with channels (K² × C_in × C_out), while the efficient alternative grows only linearly. This gap is the opportunity that efficient architectures exploit.

Why this matters now: Edge devices (phones, robots, drones, AR glasses) have strict power and latency budgets. You can't just "throw more GPU at it." Architecture efficiency is the only path to real-time inference on constrained hardware.

A standard 3×3 conv layer with C input and C output channels costs K²×C² FLOPs per spatial position. If you double the channel count from 128 to 256, how much do FLOPs increase?

2× (doubles) 3× (triples) 4× (quadruples, because C²) 8× (cubes)

Chapter 1: Standard vs Depthwise Separable

The single most important efficiency trick in deep learning is the depthwise separable convolution. It splits one expensive operation into two cheap ones — and loses almost nothing in accuracy.

Let's understand what a standard convolution actually computes. At each spatial position, a K×K×C_in patch is dot-producted with a filter of the same size, producing one output value. You need C_out such filters to produce all output channels.

Standard convolution cost per spatial position: K² × C_in × C_out multiply-adds. It mixes spatial information AND channel information simultaneously.

The key insight: spatial mixing and channel mixing don't have to happen at the same time. We can separate them.

Step 1: Depthwise Convolution (spatial only)

Apply a separate K×K filter to EACH input channel independently. No cross-channel interaction. Each channel gets its own spatial filter.

Cost per position = K² × C_in

Step 2: Pointwise Convolution (channel only)

Apply 1×1 convolutions to mix channels. This is just a matrix multiply at each spatial position — a linear projection from C_in dimensions to C_out dimensions.

Cost per position = C_in × C_out

Total depthwise separable cost:

K² × C_in + C_in × C_out

Savings ratio:

Standard / Separable = (K² × C_in × C_out) / (K² × C_in + C_in × C_out)

= 1 / (1/C_out + 1/K²) ≈ K² × C_out / (K² + C_out)

For K=3, C_out=256: savings = 9×256 / (9+256) = 2304/265 ≈ 8.7×

Worked example: 3×3 conv, 256 → 256 channels, 56×56 spatial.
Standard: 3×3×256×256 × 56×56 = 1,849,688,064 FLOPs
Depthwise: 3×3×256 × 56×56 = 7,225,344 FLOPs
Pointwise: 256×256 × 56×56 = 205,520,896 FLOPs
Total separable: 212,746,240 FLOPs — that's 8.7× cheaper!

python
import torch.nn as nn

# Standard convolution: mixes spatial + channels together
standard = nn.Conv2d(256, 256, kernel_size=3, padding=1)
# Parameters: 3*3*256*256 = 589,824

# Depthwise separable: two steps
depthwise = nn.Conv2d(256, 256, kernel_size=3, padding=1, groups=256)
pointwise = nn.Conv2d(256, 256, kernel_size=1)
# Parameters: 3*3*256 + 256*256 = 2,304 + 65,536 = 67,840
# That's 8.7x fewer parameters too!

def depthwise_separable(x):
    x = depthwise(x)   # spatial filtering per channel
    x = pointwise(x)   # channel mixing
    return x

FLOP Breakdown: Where Does Computation Go?

The standard conv is one monolithic block. The separable version splits into two small blocks. Adjust channels to see how the ratio changes.

C_out 256

Kernel 3

In a depthwise separable convolution, which step handles cross-channel interaction?

The depthwise step (K×K per channel) The pointwise step (1×1 across channels) Both steps equally

Chapter 2: MobileNet & Inverted Residuals

In 2017, Google published MobileNetV1: stack depthwise separable convolutions, add batch normalization and ReLU between each step, and you get a network that's 8× cheaper than VGG with similar accuracy. Simple and beautiful.

But V1 had a problem. The depthwise step operates on each channel independently — it can't learn cross-channel features at that stage. If you have a narrow bottleneck (few channels), the depthwise filter is starved of information. You need enough channels for the spatial filter to work with.

MobileNetV2: The Inverted Residual Block

The solution is counterintuitive. A standard residual block (ResNet) goes wide→narrow→wide: compress channels with 1×1, do the expensive 3×3 in the narrow space, then expand back. MobileNetV2 does the opposite.

Inverted residual: narrow→wide→narrow. Start with few channels (the bottleneck), EXPAND with 1×1 to give the depthwise filter more channels to work with, apply the 3×3 depthwise in the WIDE space, then PROJECT back down to the narrow bottleneck with 1×1.

Why "inverted"? Because the residual connection skips across the narrow bottleneck, not the wide inner representation. The data lives in the skinny state; the fat state is temporary, just for spatial filtering.

Input

24 channels (narrow bottleneck)

↓

Expand 1×1

24 → 144 channels (t=6)

↓

Depthwise 3×3

144 channels, spatial filtering

↓

Project 1×1

144 → 24 channels (compress)

↓

Output + Residual

24 channels (add to input)

The expansion factor t (typically 6) controls how wide the internal representation gets. With t=6 and a 24-channel bottleneck, the depthwise step works on 144 channels — plenty of capacity for rich spatial features.

Worked Example: FLOPs for One Inverted Residual Block

Input: 24 channels, 56×56 spatial, expansion t=6.

Expand: 24 × 144 × 56² = 10,838,016 FLOPs

Depthwise: 9 × 144 × 56² = 4,064,256 FLOPs

Project: 144 × 24 × 56² = 10,838,016 FLOPs

Total: 25,740,288 FLOPs

Compare to a standard 3×3 conv with 144 input and 144 output channels at 56×56:

Standard: 9 × 144 × 144 × 56² = 585,252,864 FLOPs

The inverted residual is 23× cheaper — while maintaining similar representational capacity.

python
class InvertedResidual(nn.Module):
    def __init__(self, c_in, c_out, stride, expand_ratio):
        super().__init__()
        c_mid = c_in * expand_ratio
        self.use_residual = (stride == 1 and c_in == c_out)

        layers = []
        if expand_ratio != 1:
            # Expand: 1x1 conv to widen channels
            layers += [nn.Conv2d(c_in, c_mid, 1),
                       nn.BatchNorm2d(c_mid), nn.ReLU6()]
        # Depthwise: 3x3 spatial filtering
        layers += [nn.Conv2d(c_mid, c_mid, 3, stride,
                             padding=1, groups=c_mid),
                   nn.BatchNorm2d(c_mid), nn.ReLU6()]
        # Project: 1x1 conv to narrow channels (NO activation!)
        layers += [nn.Conv2d(c_mid, c_out, 1),
                   nn.BatchNorm2d(c_out)]
        self.conv = nn.Sequential(*layers)

    def forward(self, x):
        if self.use_residual:
            return x + self.conv(x)
        return self.conv(x)

Critical detail: The projection step has NO activation function (no ReLU). Why? Because ReLU destroys information in low-dimensional spaces — it zeros out negative values. In the narrow bottleneck, every dimension is precious, so we keep it linear. This was a key insight of the MobileNetV2 paper.

Standard vs Inverted Residual Data Flow

Left: standard residual (wide→narrow→wide). Right: inverted residual (narrow→wide→narrow). The residual connection spans the bottleneck in both cases.

Expansion t 6

Why does MobileNetV2's inverted residual expand channels BEFORE the depthwise convolution?

To give the depthwise filter more channels to work with for richer spatial features To reduce the number of parameters To make the network deeper

Chapter 3: EfficientNet & Compound Scaling

You have a baseline network that works. Now you want to make it bigger for higher accuracy. You have three knobs to turn: make it deeper (more layers), wider (more channels per layer), or increase resolution (bigger input images). Which do you turn?

Most people pick one. ResNet scales depth (18→34→50→101→152). WideResNet scales width. But the EfficientNet paper (Tan & Le, 2019) discovered something profound: scaling all three together, in a specific ratio, is dramatically better than scaling any one dimension alone.

Compound scaling rule: Given a compute budget multiplier φ, scale all three dimensions simultaneously:
depth: d = α^φ
width: w = β^φ
resolution: r = γ^φ
Subject to: α · β² · γ² ≈ 2

Why those exponents? FLOPs scale as d × w² × r² (depth is linear, width and resolution are quadratic in their effect on compute). The constraint α·β²·γ² ≈ 2 ensures that each unit increase in φ roughly doubles the total FLOPs.

EfficientNet's specific values:

α = 1.2, β = 1.1, γ = 1.15

Check: 1.2 × 1.1² × 1.15² = 1.2 × 1.21 × 1.3225 = 1.92 ≈ 2 ✓

Worked Example: Scaling B0 to B3

EfficientNet-B0 baseline: 18 layers, 32 base width, 224×224 input.

For B3, φ = 3:

depth: d = 1.2³ = 1.728 → 18 × 1.728 = 31 layers

width: w = 1.1³ = 1.331 → 32 × 1.331 = 43 base channels

resolution: r = 1.15³ = 1.521 → 224 × 1.521 = 341×341 input

FLOP increase: 2³ = 8× over B0

The result: EfficientNet-B3 achieves 81.6% ImageNet accuracy with only 1.8B FLOPs — compared to ResNet-152 at 78.3% accuracy with 11.6B FLOPs. Higher accuracy, 6.4× less compute.

Model	Top-1 Acc	FLOPs	φ
EfficientNet-B0	77.1%	0.39B	0
EfficientNet-B1	79.1%	0.70B	1
EfficientNet-B2	80.1%	1.0B	2
EfficientNet-B3	81.6%	1.8B	3
EfficientNet-B4	82.9%	4.2B	4
EfficientNet-B7	84.3%	37B	7
ResNet-152	78.3%	11.6B	—

3D Compound Scaling Visualizer

Drag the φ slider to scale the network. Watch depth, width, and resolution grow together. The cube's volume represents total FLOPs.

φ 0

python
import math

# EfficientNet compound scaling
alpha = 1.2   # depth coefficient
beta  = 1.1   # width coefficient
gamma = 1.15  # resolution coefficient

def scale_model(phi, base_depth=18, base_width=32, base_res=224):
    d = math.ceil(base_depth * alpha**phi)
    w = math.ceil(base_width * beta**phi)
    r = math.ceil(base_res * gamma**phi)
    flop_mult = 2**phi  # approximately
    return {'depth': d, 'width': w, 'resolution': r,
            'flop_mult': flop_mult}

# EfficientNet-B3
b3 = scale_model(phi=3)
print(b3)  # {'depth': 32, 'width': 43, 'resolution': 341, 'flop_mult': 8}

Why does the compound scaling constraint use β² and γ² but only α¹?

Because depth is more important than width Because FLOPs scale linearly with depth but quadratically with width and resolution Because it's a convention with no mathematical basis

Chapter 4: Neural Architecture Search

What if instead of a human designing the architecture, we let an algorithm search for one? That's Neural Architecture Search (NAS) — automated discovery of optimal network structures.

The setup: define a search space (what operations are allowed? how many layers? which connections?), a search strategy (how do we explore?), and an evaluation method (how do we score each candidate?).

NAS as optimization: Think of it as a meta-learning problem. The "model" being trained is the architecture itself. The "loss" is validation accuracy (or latency, or FLOPs). The "optimizer" is whatever search algorithm we use — RL, evolutionary, gradient-based.

Search Space

A typical cell-based search space (NASNet) defines:

Operations: 3×3 conv, 5×5 conv, 3×3 depthwise, max pool, avg pool, skip, zero
Connections: which pairs of nodes have edges
Cell structure: how to wire 5-7 nodes into a computational cell
Macro structure: how many cells, where to downsample

The Cost Problem

Naive NAS (Zoph & Le, 2017): train each candidate architecture to convergence. Evaluate thousands. At 1 GPU-hour per evaluation, that's thousands of GPU-hours — absolutely impractical for most teams.

One-Shot NAS (Weight Sharing)

The breakthrough: train a single supernet that contains ALL possible architectures as sub-networks. Each candidate is a subset of the supernet's weights. To evaluate a candidate, just mask the supernet to that subset — no retraining needed.

Define Search Space

Operations, connections, cell topology

↓

Train Supernet

All paths get weight updates (stochastic sampling)

↓

Evaluate subnets using shared weights, no retraining

↓

Retrain Winner

Train the found architecture from scratch for final accuracy

python
# Simplified one-shot NAS supernet
class SupernetCell(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.ops = nn.ModuleList([
            nn.Conv2d(channels, channels, 3, padding=1),     # op 0: 3x3 conv
            nn.Conv2d(channels, channels, 5, padding=2),     # op 1: 5x5 conv
            nn.Conv2d(channels, channels, 3, padding=1,      # op 2: depthwise
                      groups=channels),
            nn.MaxPool2d(3, stride=1, padding=1),          # op 3: max pool
            nn.Identity(),                                    # op 4: skip
        ])

    def forward(self, x, arch_choice):
        # arch_choice selects which operation to use
        return self.ops[arch_choice](x)

# During search: sample arch_choice randomly per step
# After search: fix arch_choice to best found architecture

Modern NAS efficiency: Hardware-aware NAS (like MnasNet, FBNet) adds a latency penalty to the reward: reward = accuracy × (latency/target)^w. This steers the search toward architectures that are both accurate AND fast on the target device.

NAS Search Animation

Watch architectures being proposed, evaluated, and selected. Green = high accuracy, red = low. The population evolves toward better designs.

Click Run to start

What is the main advantage of one-shot NAS over naive NAS?

It finds better architectures It avoids retraining each candidate from scratch by sharing weights in a supernet It uses a smaller search space

Chapter 5: Mobile Deployment

You've designed an efficient architecture. You've trained it in PyTorch or TensorFlow on a beefy GPU server. Now you need to run it on a phone, a drone, or an embedded sensor. The gap between "works in my notebook" and "runs at 30fps on an iPhone" is enormous.

The problem: different hardware requires different representations. Your GPU training framework stores the model as a Python object graph with floating-point weights. A mobile chip needs a compact binary with quantized weights and fused operations.

The deployment pipeline: Train (PyTorch/TF) → Export (ONNX) → Optimize (graph fusion, quantization) → Compile (target-specific) → Deploy (device runtime).

Export Formats

Format	Target	Key Feature
ONNX	Interchange	Framework-agnostic graph format
TFLite	Android/embedded	Quantization-aware, small binary
CoreML	Apple devices	Neural Engine acceleration
TensorRT	NVIDIA GPUs	Kernel fusion, FP16/INT8
NNAPI	Android NPUs	Hardware abstraction layer

Graph Optimizations

Before deploying, the compiler performs transformations that don't change the math but dramatically speed up execution:

Operator fusion: Conv + BatchNorm + ReLU becomes one kernel call instead of three
Constant folding: Pre-compute anything that doesn't depend on input
Dead code elimination: Remove unused branches
Layout optimization: Rearrange memory for target hardware (NCHW vs NHWC)

Quantization

The biggest single deployment optimization. Convert FP32 weights and activations to INT8:

x_int8 = round(x_fp32 / scale) + zero_point

This gives 4× memory reduction, 2-4× speedup on INT8-capable hardware, and typically less than 1% accuracy loss with proper calibration.

python
import torch
import onnx
import onnxruntime as ort

# Step 1: Export PyTorch model to ONNX
model.eval()
dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy, "model.onnx",
                  input_names=["image"],
                  output_names=["logits"],
                  dynamic_axes={"image": {0: "batch"}})

# Step 2: Optimize with ONNX Runtime
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session = ort.InferenceSession("model.onnx", sess_options)

# Step 3: Quantize to INT8
from onnxruntime.quantization import quantize_dynamic
quantize_dynamic("model.onnx", "model_int8.onnx",
                 weight_type=QuantType.QInt8)

Operator support gaps: Not every PyTorch operation has a TFLite equivalent. Custom activations (Swish, Mish), dynamic shapes, and complex control flow often need workarounds. Always check the operator compatibility table before designing your architecture for mobile.

Deployment Pipeline Flowchart

The journey from training to inference. Green stages are hardware-agnostic; orange stages are target-specific.

What does "operator fusion" do during model compilation?

Adds more operators for accuracy Removes unused layers Combines multiple sequential operations (like Conv+BN+ReLU) into a single kernel call

Chapter 6: Hardware Acceleration

Architecture efficiency doesn't exist in a vacuum — it exists relative to hardware. An operation that's "efficient" in FLOPs might be slow if the hardware can't execute it well. Understanding hardware is how you design architectures that are fast in practice, not just in theory.

The hardware truth: FLOPs ≠ latency. A 4×4 matrix multiply is essentially free on a Tensor Core (done in one clock cycle). A weird element-wise operation with data-dependent branching takes hundreds of cycles. The hardware dictates which operations are fast.

Hardware Taxonomy

Hardware	Strength	Peak TOPS	Good At
CPU	Flexibility	~1 (INT8)	Control flow, small batches
GPU (CUDA)	Parallelism	~312 (A100)	Large matrix multiply, batch inference
Tensor Core	Matrix ops	~624 (FP16)	4x4 matmul blocks, transformer layers
TPU	Systolic arrays	~275 (v4)	Dense matmul, training at scale
Mobile NPU	Efficiency	~15 (A17)	Depthwise conv, quantized inference

The Roofline Model

Performance is limited by either compute (how fast the hardware can multiply) or memory bandwidth (how fast data can be moved). The arithmetic intensity (FLOPs per byte loaded) determines which bottleneck dominates.

Arithmetic Intensity = FLOPs / Bytes Accessed

Attainable Performance = min(Peak FLOPs, Bandwidth × Arithmetic Intensity)

Dense matrix multiply has high arithmetic intensity (~O(N) FLOPs per byte) — it's compute-bound on modern hardware. Depthwise convolution has low arithmetic intensity (only K² FLOPs per element loaded) — it's often memory-bound. This is why depthwise conv is "efficient" in FLOPs but not always fast in practice.

Design implication: On GPU/TPU, prefer large matrix multiplies (high arithmetic intensity). On mobile NPUs, depthwise separable is fine because the hardware has enough bandwidth relative to compute. Architecture efficiency is hardware-relative.

python
# Arithmetic intensity examples

# Dense matmul: C = A @ B, A is MxK, B is KxN
# FLOPs: 2*M*K*N
# Bytes: (M*K + K*N + M*N) * bytes_per_elem
# Intensity: 2*M*K*N / ((M*K + K*N + M*N) * 4) ≈ N/2 for large square

# Depthwise 3x3 conv: C channels, HxW spatial
# FLOPs: 9 * C * H * W
# Bytes: (C*H*W input + 9*C weights + C*H*W output) * 4
# Intensity: 9*C*H*W / (2*C*H*W + 9*C) * 4 ≈ 9/8 ≈ 1.1
# Very low! Memory-bound on most hardware.

def arithmetic_intensity(op_type, **kwargs):
    if op_type == "matmul":
        M, K, N = kwargs['M'], kwargs['K'], kwargs['N']
        flops = 2 * M * K * N
        bytes_moved = (M*K + K*N + M*N) * 4
    elif op_type == "depthwise":
        C, H, W, K = kwargs['C'], kwargs['H'], kwargs['W'], kwargs['K']
        flops = K*K * C * H * W
        bytes_moved = (2*C*H*W + K*K*C) * 4
    return flops / bytes_moved

Hardware Roofline Comparison

Each hardware platform has a different roofline. Operations below the line are memory-bound; at the line they are compute-bound. Hover to see where common layers fall.

Hardware A100 GPU

A depthwise 3×3 convolution has low arithmetic intensity (~1 FLOP/byte). On a high-compute GPU, this means the operation is limited by:

Memory bandwidth (can't feed data fast enough to the compute units) Compute throughput (too many FLOPs) Network latency

Chapter 7: Architecture × Hardware Co-Design (Showcase)

This is the capstone. Everything comes together: depthwise separable blocks, compound scaling, hardware awareness, and deployment constraints. You're the architect now. Your job: design a network that meets a specific latency budget on a specific device while maximizing accuracy.

The design challenge: Select a target device, a task, and a latency budget. The system will recommend an architecture configuration. You can override any choice and watch the predicted latency, memory, and accuracy change in real time.

The key insight of hardware-aware design: there's no single "best" architecture. The best architecture depends on:

Target hardware: GPU favors large matmuls; NPU favors depthwise ops; CPU is flexible but slow
Latency budget: 1ms = tiny model; 10ms = MobileNet-class; 100ms = EfficientNet-class
Task complexity: Classification needs fewer features than detection or segmentation
Memory limit: Mobile has 1-4GB; must fit model + activations + framework overhead

Design Heuristics by Hardware

Device	Preferred Blocks	Precision	Max Channels
A100 GPU	Standard conv, attention	FP16/TF32	2048+
iPhone NPU	Depthwise sep, inverted res	INT8/FP16	512
Raspberry Pi	Thin depthwise, no attention	INT8	128

Architecture Co-Design Workbench

Select a device, task, and latency budget. Adjust architecture parameters and watch predicted performance metrics change. Design a model that fits your constraints!

Device A100 GPU

Task Classification

Latency (ms) 10

Depth 20

Width 128

Block Type Inverted Res

Real-world NAS results: MnasNet used this exact approach — latency-constrained NAS targeting a Pixel phone. It found architectures that are 1.8× faster than MobileNetV2 at the same accuracy. The architecture is slightly "weird" (mix of 3×3 and 5×5, varying expansion ratios per stage) but optimal for that specific hardware.

Chapter 8: Emerging Architectures

Efficient convolutions were the story from 2017-2020. The frontier has moved. Vision Transformers, state-space models, and hardware-software co-optimizations are defining the next generation of efficient architectures.

Vision Transformers: Making Attention Efficient

The Vision Transformer (ViT) splits an image into 16×16 patches, projects each to an embedding, and applies transformer layers. The problem: self-attention is O(N²) in sequence length. For a 224×224 image with 16×16 patches, N = 196. Manageable. But for higher resolution or dense prediction, N explodes.

Efficiency tricks:

Window attention (Swin): Compute attention only within local windows of 7×7 patches. Shift windows between layers for cross-window communication. Complexity: O(N × W²) instead of O(N²).
Token pruning: Remove uninformative tokens mid-network. If a patch is "boring" (low attention from CLS token), drop it. Saves 30-50% compute with <0.5% accuracy loss.
Linear attention: Replace softmax(QK^T)V with φ(Q)(φ(K)^TV). The kernel trick makes attention O(N) instead of O(N²).

Swin Transformer efficiency: By restricting attention to 7×7 local windows and shifting them each layer, Swin achieves linear complexity in image size while maintaining global receptive field through shifted windows across layers. It matches EfficientNet accuracy with better scaling to high resolution.

State-Space Models (Mamba)

Mamba (Gu & Dao, 2023) offers an entirely different approach to sequence modeling. Instead of attention (O(N²)) or even linear attention (O(N) but with reduced capacity), Mamba uses a selective state-space model that processes sequences in O(N) time with O(1) memory per step during inference.

h_t = A · h_t-1 + B · x_t
y_t = C · h_t

The key innovation: A, B, C are input-dependent (selective), allowing the model to decide what to remember and what to forget — similar to a gated RNN, but with the parallelizable structure of SSMs during training.

FlashAttention: Hardware-Aware Algorithm Design

FlashAttention doesn't change the math of attention — it changes how it's computed to exploit the GPU memory hierarchy. Standard attention materializes the N×N attention matrix in GPU HBM (slow global memory). FlashAttention tiles the computation so that it stays in SRAM (fast on-chip memory).

Standard attention memory: O(N²)
FlashAttention memory: O(N) — same exact output!

The speedup: 2-4× faster, 5-20× less memory. This isn't approximation — it's exact attention, just computed more cleverly relative to hardware.

python
# FlashAttention: exact same math, hardware-aware implementation
# Standard (slow, O(N²) memory):
# attn = softmax(Q @ K.T / sqrt(d)) @ V

# FlashAttention (fast, O(N) memory):
# Tiles Q, K, V into blocks that fit in SRAM
# Computes attention block-by-block using online softmax
# Never materializes the full N×N matrix

from flash_attn import flash_attn_func

# Drop-in replacement: same input/output, 2-4x faster
output = flash_attn_func(q, k, v, causal=True)

# Memory comparison for sequence length 4096:
# Standard: 4096 × 4096 × 2 bytes = 32 MB per head
# FlashAttention: O(block_size) ≈ 256 KB per head

Efficiency Frontier: Architectures Through Time

Each dot is a model. X-axis: FLOPs. Y-axis: accuracy. The frontier moves up-and-left over time as architectures get more efficient.

FlashAttention achieves 2-4× speedup over standard attention by:

Approximating attention with fewer tokens Computing exact attention in tiles that fit in fast on-chip SRAM, avoiding slow global memory Using INT8 quantization for the attention matrix

Chapter 9: Mastery & Connections

You now understand the full stack of efficient architecture design: from the fundamental compute savings of depthwise separable convolutions, through compound scaling and automated search, to hardware-aware deployment and emerging paradigms.

Architecture Selection Flowchart

What's your device?

Determines op palette and precision

↓

What's your latency budget?

<5ms: MobileNet-class. <50ms: EfficientNet. >50ms: ViT/Large

↓

What's your task?

Classification: backbone only. Detection: + FPN + head. Segmentation: + decoder

↓

Optimize

NAS for your device, quantize, fuse ops, profile

FLOP/Parameter/Latency Cheat Sheet

Architecture	Params	FLOPs	Top-1	Mobile Latency
MobileNetV2 1.0	3.4M	300M	72.0%	~6ms
MobileNetV3-Small	2.5M	56M	67.4%	~3ms
EfficientNet-B0	5.3M	390M	77.1%	~12ms
EfficientNet-B3	12M	1.8B	81.6%	~45ms
Swin-Tiny	28M	4.5B	81.3%	~90ms
ConvNeXt-Tiny	28M	4.5B	82.1%	~85ms
ResNet-50	25M	4.1B	76.1%	~70ms

Derivation: Depthwise Separable Savings

Let's formally prove the savings ratio. For a layer with kernel K, C_in input channels, C_out output channels, and H×W spatial:

Standard FLOPs = K² · C_in · C_out · H · W

Separable FLOPs = (K² · C_in + C_in · C_out) · H · W

Ratio = K² · C_in · C_out / (K² · C_in + C_in · C_out)

= K² · C_out / (K² + C_out) = 1 / (1/C_out + 1/K²)

As C_out → ∞, ratio → K². For K=3: maximum savings is 9×. For typical C_out=256: savings = 9×256/(9+256) = 8.7×. For C_out=64: savings = 9×64/(9+64) = 7.9×. The savings improve with more output channels.

Design challenge: Fit a 90%+ ImageNet model under 5ms on mobile. Current approaches: (1) EfficientNet-B0 + INT8 quantization + operator fusion = ~4ms at 77%. (2) Hardware-aware NAS (MnasNet/FBNet) + knowledge distillation from a large teacher = ~5ms at 78%. Getting to 80%+ under 5ms remains an active research frontier requiring both architecture and compiler innovations.

Connections

This lesson connects to many others in the series:

Transformers — the architecture that FlashAttention and window attention make efficient
SSM/Mamba — the O(N) alternative to attention for sequence modeling
Diffusion Models — efficiency here means faster sampling (fewer steps, distillation)
VLAs — robot policies that MUST run efficiently in real-time control loops

Closing thought: "The best architecture is the one that does the most work per joule." Efficiency isn't about limitations — it's about intelligence. A sparrow's brain is a miracle of efficient design: 1 billion neurons achieving navigation, communication, and motor control on 50 milliwatts. Our architectures should aspire to the same.

For a 3×3 depthwise separable conv with 512 output channels, the theoretical FLOP savings over standard conv is approximately:

9 × 512 / (9 + 512) ≈ 8.8× Exactly 9× regardless of channels 3× (because 3×3 kernel)

Efficient ModelArchitectures