Teach: DL/ML Foundations — Engineermaxxing

What You'll Teach

Each question asks you to explain and draw a foundational concept on the whiteboard. These are the topics that come up in every ML interview, every paper discussion, every architecture decision. If you can teach all 17, you own the foundations.

Optimization & Training

Q1: Gradient Descent Landscape
Draw a 2D loss landscape with a global minimum, local minimum, and saddle point. Walk through how SGD with momentum navigates each.

Q2: Backpropagation
Draw a 3-layer network, label the forward pass, then trace the backward pass showing how the chain rule computes ∂L/∂w for every weight.

Q3: Adam vs SGD
Compare SGD, SGD+Momentum, RMSProp, and Adam. Draw what each optimizer "sees" in an elongated loss valley.

Q4: Learning Rate Schedules
Sketch warmup + cosine decay, step decay, and constant LR. Draw the training loss curves and explain when each schedule wins.

Loss Functions & Generalization

Q5: Cross-Entropy Loss
Derive cross-entropy from maximum likelihood. Draw the loss surface for a 3-class softmax and show why CE penalizes confident wrong predictions harshly.

Q6: Bias-Variance Tradeoff
Draw the classic bias-variance decomposition diagram. Show where underfitting and overfitting live, and explain how model complexity moves you along the curve.

Q7: Regularization Arsenal
Compare L1, L2, dropout, and data augmentation. Draw the weight distribution under L1 vs L2, and explain why L1 produces sparsity.

Network Components

Q8: Batch Normalization
Draw a mini-batch flowing through a BatchNorm layer. Show the normalize → scale → shift steps, label γ and β, and explain why BN helps training.

Q9: Residual Connections
Draw a residual block with the skip connection. Explain the gradient flow advantage and why ResNets can train 100+ layers while plain networks can't.

Q10: Activation Functions
Draw sigmoid, tanh, ReLU, and GELU. Show the gradient for each and explain the vanishing gradient problem with sigmoid vs why ReLU fixes it (and what "dying ReLU" means).

Architectures

Q11: Convolution Operation
Draw a 5×5 input, a 3×3 kernel, and the output feature map. Show stride=1 and stride=2 side by side. Calculate the output dimensions and explain parameter sharing.

Q12: CNN Architecture
Draw a complete CNN for image classification: input → conv blocks → pooling → flatten → FC → softmax. Label the tensor shapes at each stage for a 224×224×3 input.

Q13: RNN & Vanishing Gradients
Unroll an RNN for 5 timesteps. Draw the hidden state flow and trace the gradient path from t=5 back to t=0. Show where gradients vanish and how LSTM gates solve this.

Q14: LSTM Gates
Draw the internal structure of an LSTM cell: forget gate, input gate, cell state, output gate. Trace how information flows through for one timestep and explain what each gate learns to do.

Q15: Self-Attention
Draw the Q, K, V computation for a 4-token sequence. Show the attention matrix, explain the softmax + scaling, and demonstrate why attention is O(n²) in sequence length.

Synthesis

Q16: CNN → RNN → Transformer
Draw the evolution from CNNs to RNNs to Transformers. For each, show its core operation, what it's good at, and why the next architecture was needed.