Hardware-Aware Deep Learning

Efficient Model
Architectures

Same accuracy, 20× less compute. The difference is architecture design.

Prerequisites: Basic neural networks + Convolutions intuition. That's it.
10
Chapters
10+
Simulations
0
Assumed Knowledge

Chapter 0: Why Architecture Matters

Two image classifiers. Both achieve 76% accuracy on ImageNet. Model A uses 10 billion FLOPs per image. Model B uses 500 million FLOPs. Same answer, 20× less compute. Model B runs on your phone in real time. Model A needs a data center GPU.

The difference? Architecture design. Not training tricks, not better data, not bigger GPUs. Pure structural decisions about how information flows through the network.

Here's the fundamental equation of computational cost:

Total FLOPs = ∑layers (operations per position) × (spatial positions) × (channels)

A badly designed layer can waste 90% of its FLOPs on redundant computation — multiplying numbers that contribute almost nothing to the final answer. A well-designed layer computes only what matters.

The core insight: Compute is not free. Every multiply-add costs energy, time, and money. Architecture efficiency means getting the same representational power with fewer operations — not by being approximate, but by being structurally clever.

Let's make this concrete. A standard 3×3 convolution with 256 input channels and 256 output channels on a 56×56 feature map costs:

FLOPs = 3 × 3 × 256 × 256 × 56 × 56 = 1.85 billion

That's almost 2 billion multiply-adds for a single layer. A typical network has 50+ such layers. The question isn't "can we afford this?" — it's "do we need all of it?"

FLOP Calculator: Standard vs Efficient

Adjust kernel size, channels, and spatial resolution to see how FLOPs explode for standard convolutions. The teal bar shows depthwise separable cost for comparison.

Kernel 3
Channels 256
Spatial 56

Notice how the standard convolution bar grows cubically with channels (K² × Cin × Cout), while the efficient alternative grows only linearly. This gap is the opportunity that efficient architectures exploit.

Why this matters now: Edge devices (phones, robots, drones, AR glasses) have strict power and latency budgets. You can't just "throw more GPU at it." Architecture efficiency is the only path to real-time inference on constrained hardware.
A standard 3×3 conv layer with C input and C output channels costs K²×C² FLOPs per spatial position. If you double the channel count from 128 to 256, how much do FLOPs increase?

Chapter 1: Standard vs Depthwise Separable

The single most important efficiency trick in deep learning is the depthwise separable convolution. It splits one expensive operation into two cheap ones — and loses almost nothing in accuracy.

Let's understand what a standard convolution actually computes. At each spatial position, a K×K×Cin patch is dot-producted with a filter of the same size, producing one output value. You need Cout such filters to produce all output channels.

Standard convolution cost per spatial position: K² × Cin × Cout multiply-adds. It mixes spatial information AND channel information simultaneously.

The key insight: spatial mixing and channel mixing don't have to happen at the same time. We can separate them.

Step 1: Depthwise Convolution (spatial only)

Apply a separate K×K filter to EACH input channel independently. No cross-channel interaction. Each channel gets its own spatial filter.

Cost per position = K² × Cin

Step 2: Pointwise Convolution (channel only)

Apply 1×1 convolutions to mix channels. This is just a matrix multiply at each spatial position — a linear projection from Cin dimensions to Cout dimensions.

Cost per position = Cin × Cout

Total depthwise separable cost:

K² × Cin + Cin × Cout

Savings ratio:

Standard / Separable = (K² × Cin × Cout) / (K² × Cin + Cin × Cout)
= 1 / (1/Cout + 1/K²) ≈ K² × Cout / (K² + Cout)

For K=3, Cout=256: savings = 9×256 / (9+256) = 2304/265 ≈ 8.7×

Worked example: 3×3 conv, 256 → 256 channels, 56×56 spatial.
Standard: 3×3×256×256 × 56×56 = 1,849,688,064 FLOPs
Depthwise: 3×3×256 × 56×56 = 7,225,344 FLOPs
Pointwise: 256×256 × 56×56 = 205,520,896 FLOPs
Total separable: 212,746,240 FLOPs — that's 8.7× cheaper!
python
import torch.nn as nn

# Standard convolution: mixes spatial + channels together
standard = nn.Conv2d(256, 256, kernel_size=3, padding=1)
# Parameters: 3*3*256*256 = 589,824

# Depthwise separable: two steps
depthwise = nn.Conv2d(256, 256, kernel_size=3, padding=1, groups=256)
pointwise = nn.Conv2d(256, 256, kernel_size=1)
# Parameters: 3*3*256 + 256*256 = 2,304 + 65,536 = 67,840
# That's 8.7x fewer parameters too!

def depthwise_separable(x):
    x = depthwise(x)   # spatial filtering per channel
    x = pointwise(x)   # channel mixing
    return x
FLOP Breakdown: Where Does Computation Go?

The standard conv is one monolithic block. The separable version splits into two small blocks. Adjust channels to see how the ratio changes.

Cout 256
Kernel 3
In a depthwise separable convolution, which step handles cross-channel interaction?

Chapter 2: MobileNet & Inverted Residuals

In 2017, Google published MobileNetV1: stack depthwise separable convolutions, add batch normalization and ReLU between each step, and you get a network that's 8× cheaper than VGG with similar accuracy. Simple and beautiful.

But V1 had a problem. The depthwise step operates on each channel independently — it can't learn cross-channel features at that stage. If you have a narrow bottleneck (few channels), the depthwise filter is starved of information. You need enough channels for the spatial filter to work with.

MobileNetV2: The Inverted Residual Block

The solution is counterintuitive. A standard residual block (ResNet) goes wide→narrow→wide: compress channels with 1×1, do the expensive 3×3 in the narrow space, then expand back. MobileNetV2 does the opposite.

Inverted residual: narrow→wide→narrow. Start with few channels (the bottleneck), EXPAND with 1×1 to give the depthwise filter more channels to work with, apply the 3×3 depthwise in the WIDE space, then PROJECT back down to the narrow bottleneck with 1×1.

Why "inverted"? Because the residual connection skips across the narrow bottleneck, not the wide inner representation. The data lives in the skinny state; the fat state is temporary, just for spatial filtering.

Input
24 channels (narrow bottleneck)
Expand 1×1
24 → 144 channels (t=6)
Depthwise 3×3
144 channels, spatial filtering
Project 1×1
144 → 24 channels (compress)
Output + Residual
24 channels (add to input)

The expansion factor t (typically 6) controls how wide the internal representation gets. With t=6 and a 24-channel bottleneck, the depthwise step works on 144 channels — plenty of capacity for rich spatial features.

Worked Example: FLOPs for One Inverted Residual Block

Input: 24 channels, 56×56 spatial, expansion t=6.

Expand: 24 × 144 × 56² = 10,838,016 FLOPs
Depthwise: 9 × 144 × 56² = 4,064,256 FLOPs
Project: 144 × 24 × 56² = 10,838,016 FLOPs
Total: 25,740,288 FLOPs

Compare to a standard 3×3 conv with 144 input and 144 output channels at 56×56:

Standard: 9 × 144 × 144 × 56² = 585,252,864 FLOPs

The inverted residual is 23× cheaper — while maintaining similar representational capacity.

python
class InvertedResidual(nn.Module):
    def __init__(self, c_in, c_out, stride, expand_ratio):
        super().__init__()
        c_mid = c_in * expand_ratio
        self.use_residual = (stride == 1 and c_in == c_out)

        layers = []
        if expand_ratio != 1:
            # Expand: 1x1 conv to widen channels
            layers += [nn.Conv2d(c_in, c_mid, 1),
                       nn.BatchNorm2d(c_mid), nn.ReLU6()]
        # Depthwise: 3x3 spatial filtering
        layers += [nn.Conv2d(c_mid, c_mid, 3, stride,
                             padding=1, groups=c_mid),
                   nn.BatchNorm2d(c_mid), nn.ReLU6()]
        # Project: 1x1 conv to narrow channels (NO activation!)
        layers += [nn.Conv2d(c_mid, c_out, 1),
                   nn.BatchNorm2d(c_out)]
        self.conv = nn.Sequential(*layers)

    def forward(self, x):
        if self.use_residual:
            return x + self.conv(x)
        return self.conv(x)
Critical detail: The projection step has NO activation function (no ReLU). Why? Because ReLU destroys information in low-dimensional spaces — it zeros out negative values. In the narrow bottleneck, every dimension is precious, so we keep it linear. This was a key insight of the MobileNetV2 paper.
Standard vs Inverted Residual Data Flow

Left: standard residual (wide→narrow→wide). Right: inverted residual (narrow→wide→narrow). The residual connection spans the bottleneck in both cases.

Expansion t 6
Why does MobileNetV2's inverted residual expand channels BEFORE the depthwise convolution?

Chapter 3: EfficientNet & Compound Scaling

You have a baseline network that works. Now you want to make it bigger for higher accuracy. You have three knobs to turn: make it deeper (more layers), wider (more channels per layer), or increase resolution (bigger input images). Which do you turn?

Most people pick one. ResNet scales depth (18→34→50→101→152). WideResNet scales width. But the EfficientNet paper (Tan & Le, 2019) discovered something profound: scaling all three together, in a specific ratio, is dramatically better than scaling any one dimension alone.

Compound scaling rule: Given a compute budget multiplier φ, scale all three dimensions simultaneously:
depth: d = αφ
width: w = βφ
resolution: r = γφ
Subject to: α · β² · γ² ≈ 2

Why those exponents? FLOPs scale as d × w² × r² (depth is linear, width and resolution are quadratic in their effect on compute). The constraint α·β²·γ² ≈ 2 ensures that each unit increase in φ roughly doubles the total FLOPs.

EfficientNet's specific values:

α = 1.2, β = 1.1, γ = 1.15

Check: 1.2 × 1.1² × 1.15² = 1.2 × 1.21 × 1.3225 = 1.92 ≈ 2 ✓

Worked Example: Scaling B0 to B3

EfficientNet-B0 baseline: 18 layers, 32 base width, 224×224 input.

For B3, φ = 3:

depth: d = 1.2³ = 1.728 → 18 × 1.728 = 31 layers
width: w = 1.1³ = 1.331 → 32 × 1.331 = 43 base channels
resolution: r = 1.15³ = 1.521 → 224 × 1.521 = 341×341 input
FLOP increase: 2³ = 8× over B0

The result: EfficientNet-B3 achieves 81.6% ImageNet accuracy with only 1.8B FLOPs — compared to ResNet-152 at 78.3% accuracy with 11.6B FLOPs. Higher accuracy, 6.4× less compute.

ModelTop-1 AccFLOPsφ
EfficientNet-B077.1%0.39B0
EfficientNet-B179.1%0.70B1
EfficientNet-B280.1%1.0B2
EfficientNet-B381.6%1.8B3
EfficientNet-B482.9%4.2B4
EfficientNet-B784.3%37B7
ResNet-15278.3%11.6B
3D Compound Scaling Visualizer

Drag the φ slider to scale the network. Watch depth, width, and resolution grow together. The cube's volume represents total FLOPs.

φ 0
python
import math

# EfficientNet compound scaling
alpha = 1.2   # depth coefficient
beta  = 1.1   # width coefficient
gamma = 1.15  # resolution coefficient

def scale_model(phi, base_depth=18, base_width=32, base_res=224):
    d = math.ceil(base_depth * alpha**phi)
    w = math.ceil(base_width * beta**phi)
    r = math.ceil(base_res * gamma**phi)
    flop_mult = 2**phi  # approximately
    return {'depth': d, 'width': w, 'resolution': r,
            'flop_mult': flop_mult}

# EfficientNet-B3
b3 = scale_model(phi=3)
print(b3)  # {'depth': 32, 'width': 43, 'resolution': 341, 'flop_mult': 8}
Why does the compound scaling constraint use β² and γ² but only α¹?

Chapter 4: Neural Architecture Search

What if instead of a human designing the architecture, we let an algorithm search for one? That's Neural Architecture Search (NAS) — automated discovery of optimal network structures.

The setup: define a search space (what operations are allowed? how many layers? which connections?), a search strategy (how do we explore?), and an evaluation method (how do we score each candidate?).

NAS as optimization: Think of it as a meta-learning problem. The "model" being trained is the architecture itself. The "loss" is validation accuracy (or latency, or FLOPs). The "optimizer" is whatever search algorithm we use — RL, evolutionary, gradient-based.

Search Space

A typical cell-based search space (NASNet) defines:

The Cost Problem

Naive NAS (Zoph & Le, 2017): train each candidate architecture to convergence. Evaluate thousands. At 1 GPU-hour per evaluation, that's thousands of GPU-hours — absolutely impractical for most teams.

One-Shot NAS (Weight Sharing)

The breakthrough: train a single supernet that contains ALL possible architectures as sub-networks. Each candidate is a subset of the supernet's weights. To evaluate a candidate, just mask the supernet to that subset — no retraining needed.

Define Search Space
Operations, connections, cell topology
Train Supernet
All paths get weight updates (stochastic sampling)
Search
Evaluate subnets using shared weights, no retraining
Retrain Winner
Train the found architecture from scratch for final accuracy
python
# Simplified one-shot NAS supernet
class SupernetCell(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.ops = nn.ModuleList([
            nn.Conv2d(channels, channels, 3, padding=1),     # op 0: 3x3 conv
            nn.Conv2d(channels, channels, 5, padding=2),     # op 1: 5x5 conv
            nn.Conv2d(channels, channels, 3, padding=1,      # op 2: depthwise
                      groups=channels),
            nn.MaxPool2d(3, stride=1, padding=1),          # op 3: max pool
            nn.Identity(),                                    # op 4: skip
        ])

    def forward(self, x, arch_choice):
        # arch_choice selects which operation to use
        return self.ops[arch_choice](x)

# During search: sample arch_choice randomly per step
# After search: fix arch_choice to best found architecture
Modern NAS efficiency: Hardware-aware NAS (like MnasNet, FBNet) adds a latency penalty to the reward: reward = accuracy × (latency/target)w. This steers the search toward architectures that are both accurate AND fast on the target device.
NAS Search Animation

Watch architectures being proposed, evaluated, and selected. Green = high accuracy, red = low. The population evolves toward better designs.

Click Run to start
What is the main advantage of one-shot NAS over naive NAS?

Chapter 5: Mobile Deployment

You've designed an efficient architecture. You've trained it in PyTorch or TensorFlow on a beefy GPU server. Now you need to run it on a phone, a drone, or an embedded sensor. The gap between "works in my notebook" and "runs at 30fps on an iPhone" is enormous.

The problem: different hardware requires different representations. Your GPU training framework stores the model as a Python object graph with floating-point weights. A mobile chip needs a compact binary with quantized weights and fused operations.

The deployment pipeline: Train (PyTorch/TF) → Export (ONNX) → Optimize (graph fusion, quantization) → Compile (target-specific) → Deploy (device runtime).

Export Formats

FormatTargetKey Feature
ONNXInterchangeFramework-agnostic graph format
TFLiteAndroid/embeddedQuantization-aware, small binary
CoreMLApple devicesNeural Engine acceleration
TensorRTNVIDIA GPUsKernel fusion, FP16/INT8
NNAPIAndroid NPUsHardware abstraction layer

Graph Optimizations

Before deploying, the compiler performs transformations that don't change the math but dramatically speed up execution:

Quantization

The biggest single deployment optimization. Convert FP32 weights and activations to INT8:

xint8 = round(xfp32 / scale) + zero_point

This gives 4× memory reduction, 2-4× speedup on INT8-capable hardware, and typically less than 1% accuracy loss with proper calibration.

python
import torch
import onnx
import onnxruntime as ort

# Step 1: Export PyTorch model to ONNX
model.eval()
dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy, "model.onnx",
                  input_names=["image"],
                  output_names=["logits"],
                  dynamic_axes={"image": {0: "batch"}})

# Step 2: Optimize with ONNX Runtime
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session = ort.InferenceSession("model.onnx", sess_options)

# Step 3: Quantize to INT8
from onnxruntime.quantization import quantize_dynamic
quantize_dynamic("model.onnx", "model_int8.onnx",
                 weight_type=QuantType.QInt8)
Operator support gaps: Not every PyTorch operation has a TFLite equivalent. Custom activations (Swish, Mish), dynamic shapes, and complex control flow often need workarounds. Always check the operator compatibility table before designing your architecture for mobile.
Deployment Pipeline Flowchart

The journey from training to inference. Green stages are hardware-agnostic; orange stages are target-specific.

What does "operator fusion" do during model compilation?

Chapter 6: Hardware Acceleration

Architecture efficiency doesn't exist in a vacuum — it exists relative to hardware. An operation that's "efficient" in FLOPs might be slow if the hardware can't execute it well. Understanding hardware is how you design architectures that are fast in practice, not just in theory.

The hardware truth: FLOPs ≠ latency. A 4×4 matrix multiply is essentially free on a Tensor Core (done in one clock cycle). A weird element-wise operation with data-dependent branching takes hundreds of cycles. The hardware dictates which operations are fast.

Hardware Taxonomy

HardwareStrengthPeak TOPSGood At
CPUFlexibility~1 (INT8)Control flow, small batches
GPU (CUDA)Parallelism~312 (A100)Large matrix multiply, batch inference
Tensor CoreMatrix ops~624 (FP16)4x4 matmul blocks, transformer layers
TPUSystolic arrays~275 (v4)Dense matmul, training at scale
Mobile NPUEfficiency~15 (A17)Depthwise conv, quantized inference

The Roofline Model

Performance is limited by either compute (how fast the hardware can multiply) or memory bandwidth (how fast data can be moved). The arithmetic intensity (FLOPs per byte loaded) determines which bottleneck dominates.

Arithmetic Intensity = FLOPs / Bytes Accessed
Attainable Performance = min(Peak FLOPs, Bandwidth × Arithmetic Intensity)

Dense matrix multiply has high arithmetic intensity (~O(N) FLOPs per byte) — it's compute-bound on modern hardware. Depthwise convolution has low arithmetic intensity (only K² FLOPs per element loaded) — it's often memory-bound. This is why depthwise conv is "efficient" in FLOPs but not always fast in practice.

Design implication: On GPU/TPU, prefer large matrix multiplies (high arithmetic intensity). On mobile NPUs, depthwise separable is fine because the hardware has enough bandwidth relative to compute. Architecture efficiency is hardware-relative.
python
# Arithmetic intensity examples

# Dense matmul: C = A @ B, A is MxK, B is KxN
# FLOPs: 2*M*K*N
# Bytes: (M*K + K*N + M*N) * bytes_per_elem
# Intensity: 2*M*K*N / ((M*K + K*N + M*N) * 4) ≈ N/2 for large square

# Depthwise 3x3 conv: C channels, HxW spatial
# FLOPs: 9 * C * H * W
# Bytes: (C*H*W input + 9*C weights + C*H*W output) * 4
# Intensity: 9*C*H*W / (2*C*H*W + 9*C) * 4 ≈ 9/8 ≈ 1.1
# Very low! Memory-bound on most hardware.

def arithmetic_intensity(op_type, **kwargs):
    if op_type == "matmul":
        M, K, N = kwargs['M'], kwargs['K'], kwargs['N']
        flops = 2 * M * K * N
        bytes_moved = (M*K + K*N + M*N) * 4
    elif op_type == "depthwise":
        C, H, W, K = kwargs['C'], kwargs['H'], kwargs['W'], kwargs['K']
        flops = K*K * C * H * W
        bytes_moved = (2*C*H*W + K*K*C) * 4
    return flops / bytes_moved
Hardware Roofline Comparison

Each hardware platform has a different roofline. Operations below the line are memory-bound; at the line they are compute-bound. Hover to see where common layers fall.

Hardware A100 GPU
A depthwise 3×3 convolution has low arithmetic intensity (~1 FLOP/byte). On a high-compute GPU, this means the operation is limited by:

Chapter 7: Architecture × Hardware Co-Design (Showcase)

This is the capstone. Everything comes together: depthwise separable blocks, compound scaling, hardware awareness, and deployment constraints. You're the architect now. Your job: design a network that meets a specific latency budget on a specific device while maximizing accuracy.

The design challenge: Select a target device, a task, and a latency budget. The system will recommend an architecture configuration. You can override any choice and watch the predicted latency, memory, and accuracy change in real time.

The key insight of hardware-aware design: there's no single "best" architecture. The best architecture depends on:

Design Heuristics by Hardware

DevicePreferred BlocksPrecisionMax Channels
A100 GPUStandard conv, attentionFP16/TF322048+
iPhone NPUDepthwise sep, inverted resINT8/FP16512
Raspberry PiThin depthwise, no attentionINT8128
Architecture Co-Design Workbench

Select a device, task, and latency budget. Adjust architecture parameters and watch predicted performance metrics change. Design a model that fits your constraints!

Device A100 GPU
Task Classification
Latency (ms) 10
Depth 20
Width 128
Block Type Inverted Res
Real-world NAS results: MnasNet used this exact approach — latency-constrained NAS targeting a Pixel phone. It found architectures that are 1.8× faster than MobileNetV2 at the same accuracy. The architecture is slightly "weird" (mix of 3×3 and 5×5, varying expansion ratios per stage) but optimal for that specific hardware.

Chapter 8: Emerging Architectures

Efficient convolutions were the story from 2017-2020. The frontier has moved. Vision Transformers, state-space models, and hardware-software co-optimizations are defining the next generation of efficient architectures.

Vision Transformers: Making Attention Efficient

The Vision Transformer (ViT) splits an image into 16×16 patches, projects each to an embedding, and applies transformer layers. The problem: self-attention is O(N²) in sequence length. For a 224×224 image with 16×16 patches, N = 196. Manageable. But for higher resolution or dense prediction, N explodes.

Efficiency tricks:

Swin Transformer efficiency: By restricting attention to 7×7 local windows and shifting them each layer, Swin achieves linear complexity in image size while maintaining global receptive field through shifted windows across layers. It matches EfficientNet accuracy with better scaling to high resolution.

State-Space Models (Mamba)

Mamba (Gu & Dao, 2023) offers an entirely different approach to sequence modeling. Instead of attention (O(N²)) or even linear attention (O(N) but with reduced capacity), Mamba uses a selective state-space model that processes sequences in O(N) time with O(1) memory per step during inference.

ht = A · ht-1 + B · xt
yt = C · ht

The key innovation: A, B, C are input-dependent (selective), allowing the model to decide what to remember and what to forget — similar to a gated RNN, but with the parallelizable structure of SSMs during training.

FlashAttention: Hardware-Aware Algorithm Design

FlashAttention doesn't change the math of attention — it changes how it's computed to exploit the GPU memory hierarchy. Standard attention materializes the N×N attention matrix in GPU HBM (slow global memory). FlashAttention tiles the computation so that it stays in SRAM (fast on-chip memory).

Standard attention memory: O(N²)
FlashAttention memory: O(N) — same exact output!

The speedup: 2-4× faster, 5-20× less memory. This isn't approximation — it's exact attention, just computed more cleverly relative to hardware.

python
# FlashAttention: exact same math, hardware-aware implementation
# Standard (slow, O(N²) memory):
# attn = softmax(Q @ K.T / sqrt(d)) @ V

# FlashAttention (fast, O(N) memory):
# Tiles Q, K, V into blocks that fit in SRAM
# Computes attention block-by-block using online softmax
# Never materializes the full N×N matrix

from flash_attn import flash_attn_func

# Drop-in replacement: same input/output, 2-4x faster
output = flash_attn_func(q, k, v, causal=True)

# Memory comparison for sequence length 4096:
# Standard: 4096 × 4096 × 2 bytes = 32 MB per head
# FlashAttention: O(block_size) ≈ 256 KB per head
Efficiency Frontier: Architectures Through Time

Each dot is a model. X-axis: FLOPs. Y-axis: accuracy. The frontier moves up-and-left over time as architectures get more efficient.

FlashAttention achieves 2-4× speedup over standard attention by:

Chapter 9: Mastery & Connections

You now understand the full stack of efficient architecture design: from the fundamental compute savings of depthwise separable convolutions, through compound scaling and automated search, to hardware-aware deployment and emerging paradigms.

Architecture Selection Flowchart

What's your device?
Determines op palette and precision
What's your latency budget?
<5ms: MobileNet-class. <50ms: EfficientNet. >50ms: ViT/Large
What's your task?
Classification: backbone only. Detection: + FPN + head. Segmentation: + decoder
Optimize
NAS for your device, quantize, fuse ops, profile

FLOP/Parameter/Latency Cheat Sheet

ArchitectureParamsFLOPsTop-1Mobile Latency
MobileNetV2 1.03.4M300M72.0%~6ms
MobileNetV3-Small2.5M56M67.4%~3ms
EfficientNet-B05.3M390M77.1%~12ms
EfficientNet-B312M1.8B81.6%~45ms
Swin-Tiny28M4.5B81.3%~90ms
ConvNeXt-Tiny28M4.5B82.1%~85ms
ResNet-5025M4.1B76.1%~70ms

Derivation: Depthwise Separable Savings

Let's formally prove the savings ratio. For a layer with kernel K, Cin input channels, Cout output channels, and H×W spatial:

Standard FLOPs = K² · Cin · Cout · H · W
Separable FLOPs = (K² · Cin + Cin · Cout) · H · W
Ratio = K² · Cin · Cout / (K² · Cin + Cin · Cout)
= K² · Cout / (K² + Cout) = 1 / (1/Cout + 1/K²)

As Cout → ∞, ratio → K². For K=3: maximum savings is 9×. For typical Cout=256: savings = 9×256/(9+256) = 8.7×. For Cout=64: savings = 9×64/(9+64) = 7.9×. The savings improve with more output channels.

Design challenge: Fit a 90%+ ImageNet model under 5ms on mobile. Current approaches: (1) EfficientNet-B0 + INT8 quantization + operator fusion = ~4ms at 77%. (2) Hardware-aware NAS (MnasNet/FBNet) + knowledge distillation from a large teacher = ~5ms at 78%. Getting to 80%+ under 5ms remains an active research frontier requiring both architecture and compiler innovations.

Connections

This lesson connects to many others in the series:

Closing thought: "The best architecture is the one that does the most work per joule." Efficiency isn't about limitations — it's about intelligence. A sparrow's brain is a miracle of efficient design: 1 billion neurons achieving navigation, communication, and motor control on 50 milliwatts. Our architectures should aspire to the same.
For a 3×3 depthwise separable conv with 512 output channels, the theoretical FLOP savings over standard conv is approximately: