Build intuition from absolute zero. Every concept from first principles, with interactive simulations, step-by-step math, and quizzes. No prerequisites beyond curiosity.
From absolute zero to understanding every line of Karpathy's 243-line GPT.
Self-attention, multi-head, KV cache, MoE — the architecture behind everything.
Split an image into patches, treat them as tokens, run a Transformer. How ViT unified vision and language architectures.
Why one design rules everything — retrofitting patterns, cross-attention, conditioning zoo, composition.
Linear recurrence, selective scan — the O(n) alternative to attention.
Add noise, learn to reverse it. The dominant generative paradigm.
Straight paths from noise to data — simpler, faster than diffusion.
Replace the U-Net with a Transformer. adaLN-Zero conditioning, scaling laws, and how DiT powers SD3, FLUX, and Sora.
The secret plumbing — variational inference, codebooks, tokenization.
Generator vs discriminator — the adversarial game.
The glue connecting images and text in a shared space.
Teaching language models to see — vision encoder + LLM fusion.
Foundation models that physically act in the world.
Learning to imagine before acting — prediction as intelligence.
Reconstructing 3D worlds from 2D photographs.
RLHF, DPO, Constitutional AI — making AI do what we want.
From vibe checks to rigorous testing. Graders, pass@K/pass^K, τ-bench, Terminal-Bench, the Swiss cheese model.
Scaling without paying for it: experts & routing, top-k gating, the collapse problem & load-balancing loss, capacity/token-dropping, Switch, Mixtral & DeepSeek, expert parallelism.
Letting the model think longer: self-consistency, best-of-N & verifiers, chain-of-thought as compute, tree search, o1-style reasoning RL, the train-vs-test tradeoff, and the overthinking trap.
Seeing every pixel: encoder-decoder, the bottleneck blur, skip connections (the one big idea), upsampling & checkerboards, Dice loss, and why it became the backbone of diffusion models.
Escaping the quadratic wall: where the n² comes from, the kernel/associativity trick, the recurrent dual form, the recall catch, RWKV’s decay & gating, the Mamba/RetNet family, and hybrid models.
The LSTM reborn: the two fatal flaws, saturating sigmoids, exponential gating + normalizer (sLSTM), matrix memory (mLSTM), parallelizability, the revision lab, the block, and the recurrent-revival family.
The impossible triangle: retention (attention minus softmax plus decay), and one mechanism computed three equivalent ways — parallel to train, recurrent to deploy, chunkwise to scale. Multi-scale heads & the family.
Attention by convolution: a sequence-length filter, made practical by implicit filters (any length, few params), the FFT (n log n), and data-controlled gating — plus the recurrence=convolution duality.
The hybrid that works: interleave mostly-Mamba layers with a few attention layers (for recall) plus MoE (for cheap capacity) — why a few attention layers suffice, the KV-cache memory win, and the hybrid era.
Google DeepMind’s RNN-speed transformers: a gated linear recurrence (RG-LRU) for cheap global memory + local attention for sharp recall. Linear recurrence → parallel scan → stability → gating → the hybrid stack → constant-size inference.
Yann LeCun’s Joint Embedding Predictive Architecture: stop predicting pixels, predict meaning. Why pixel loss blurs → latent prediction → collapse & the EMA fix → predictor + mask tokens → masking → the full I-JEPA loop → world models.
One architecture for any modality: funnel a huge input through a small latent bottleneck with cross-attention. The n² wall → latent workspace → cross-attention → latent thinking → Fourier position → iteration → arbitrary outputs via query arrays.
Networks with infinite, continuous depth. A residual block is one Euler step — take the limit and depth becomes a flow field solved by an ODE solver: adaptive computation, O(1)-memory adjoint gradients, and the continuous-normalizing-flow ancestor of diffusion.
Kolmogorov–Arnold Networks: flip the MLP — put learnable 1-D functions (splines) on the edges, let nodes just sum. The result is a network you can read, prune, refine, and turn into a symbolic formula.
MIT’s tiny, robust brains: continuous-time neurons whose response speed adapts to the input. Leaky neuron → liquid time constant → bounded stability → robustness → neural circuit policies → closed-form CfC. 19 neurons can drive a car.
How machines learned to hear: turn sound into a log-mel picture, run a vanilla transformer encoder-decoder over it, control the task with prompt tokens, and train on 680,000 hours of the messy internet. The audio front-end behind all modern speech models.
Sora-style spacetime diffusion: compress video into a spacetime latent, chop it into patches spanning space and time, and denoise the whole clip with a diffusion transformer whose attention crosses frames — coherence by construction, plus emergent world-model behavior.
The denoising that paints images can drive a robot. Instead of regressing one action (and averaging multimodal demos into a crash), it generates an action chunk from noise — committing to a mode, conditioned on observations, replanned in a closed loop. The action head behind modern VLAs.
The journey from a raw pressure wave to the log-mel spectrogram every audio model eats: sampling & Nyquist, the Fourier transform, the STFT, the spectrogram, and the mel/log perceptual scales — plus MFCCs and learned front-ends.
How EnCodec/SoundStream turn sound into discrete tokens a language model can generate: conv autoencoder, vector quantization, residual codebooks (coarse→fine, scalable bitrate), and the GAN that keeps it crisp. The bridge to AudioLM, MusicGen, VALL-E.
From robotic to human: text frontend (G2P), acoustic model, vocoder, the alignment problem, and the one-to-many trick — through Tacotron, FastSpeech, VITS, and modern diffusion & codec-token TTS, plus voice cloning.
BERT for audio: Wav2Vec 2.0 & HuBERT learn speech from unlabeled sound by masking spans and predicting discrete speech units — so ten minutes of transcripts can train a recognizer. Pretrain+fine-tune, contrastive vs cluster-predict.
From text prompt to song: tokenize audio, model long-range structure with semantic + acoustic tokens, the AudioLM hierarchy and MusicGen delay pattern, joint text-music embeddings, and diffusion approaches (Stable Audio). The capstone of the Audio track.
Deep learning on unordered 3D point sets: permutation invariance via max pooling (PointNet), hierarchical local features (PointNet++), point transformers, and ICP registration. Where convolutions can’t go.
Learning on nodes and edges via message passing: each node gathers neighbor messages, aggregates symmetrically, and updates. GCN, GAT, over-smoothing, and the unifying view (a transformer is a GNN on a full graph).
Predicting the future with its uncertainty: lookback→horizon, trend/seasonality decomposition, probabilistic (quantile) forecasts, global models, N-BEATS, the Temporal Fusion Transformer, PatchTST, and zero-shot foundation models.
Fit a distribution over functions, not one curve — getting calibrated uncertainty for free. Kernels, conditioning, error bars, the n³ wall, and the broader uncertainty toolkit (deep ensembles, MC dropout, conformal prediction).
The engines behind Netflix/Amazon/YouTube: matrix factorization, user & item embeddings, the two-tower retrieval model, the retrieve-then-rank funnel, DLRM, and hard problems (cold start, feedback loops). Personalization at billion-item scale.
How text becomes the integers a model sees: the word-vs-character trade-off, byte-pair encoding built from scratch, encoding by merge-replay, WordPiece & Unigram, byte-level BPE (no OOV ever), and SentencePiece’s ▁ spaces. Why models can’t count the r’s in “strawberry.”
The model outputs a distribution, not a word — decoding turns it into text. Logits & softmax, greedy’s repetition trap, temperature, top-k, top-p (nucleus), beam search, and the modern penalty pipeline. The dial between robotic and unhinged.
Fine-tune a 70B model by training under 1% of it. The low-rank insight (the update is low-rank), LoRA’s frozen-W + B·A adapter, parameter counting, the B=0 zero-start, merge-vs-swap deployment, QLoRA’s 4-bit base, and the wider PEFT family.
MSE, cross-entropy, KL divergence, Huber, contrastive, triplet, InfoNCE — every loss derived from scratch with interactive comparisons.
SGD, momentum, AdaGrad, RMSProp, Adam, AdamW, Lion, Sophia — every update rule derived, hand-computed, and raced head-to-head.
BatchNorm, LayerNorm, RMSNorm, GroupNorm, InstanceNorm, Pre-LN vs Post-LN — every variant derived, hand-computed, and raced in the Arena.
MHA, MQA, GQA, sliding window, linear attention, FlashAttention — every variant derived with memory arithmetic and raced in the Arena.
Sinusoidal, learned, RoPE, ALiBi, NTK scaling — why transformers need position and how rotation won, with length extrapolation arena.
Sigmoid, ReLU, GELU, SiLU/Swish, SwiGLU, Mish — every nonlinearity derived with dead neuron analysis and racing arena.
Warmup, step decay, cosine annealing, 1cycle, WSD — every schedule derived with loss landscape simulations and racing arena.
Xavier, He/Kaiming, orthogonal, transformer recipes — every method derived from variance preservation with deep network explorer.
Vanishing/exploding gradients, clipping, accumulation, mixed precision, loss scaling, checkpointing — the complete stability toolkit.
ResNet residuals, Pre-LN vs Post-LN, DenseNet, Highway — why every modern architecture uses shortcuts and how to choose.
Max, average, GAP, CLS, mean, attention, GeM pooling — how to collapse features into fixed-size vectors for any task.
Token, position, segment, patch embeddings — tied weights, scaling, subword effects, and the lookup tables behind every model.
Epochs, batches, DataLoaders, shuffling, the 6-step training loop, eval mode, common bugs — the complete anatomy of training.
Flips, crops, color jitter, RandAugment, Mixup, CutMix, text augmentation, TTA — synthetic variations that prevent overfitting.
Difficulty scoring, Bengio’s classical curriculum, pacing functions, self-paced learning, teacher-student bandits, DoReMi data mixing — and when ordering doesn’t help.
Self-supervised vision without labels: InfoNCE, temperature, the projection head you throw away, MoCo’s queue, and how BYOL & DINO dodge collapse with no negatives.
Teaching a small model to think like a giant: dark knowledge, temperature softening, the KD loss, feature & attention matching, self-distillation, DistilBERT, and the capacity-gap trap.
Co-adaptation, the mask & inverted-dropout scaling, the ensemble view, spatial dropout, DropConnect, stochastic depth/DropPath, DropBlock — breaking your network on purpose to make it generalize.
LLM-guided evolutionary search over programs. Cap sets, bin packing, scoring functions, best-shot sampling, islands model.
Hands-on: implement the FunSearch loop end-to-end. Prompt design, sandbox execution, evolutionary selection, scaling experiments.
SP+ML overview, applications, course roadmap, full pipeline demo.
x[n], δ[n], periodicity theorem, Nyquist sampling, aliasing.
Bennett’s theorem, SQNR, 6 dB/bit rule, proof sketch.
Non-uniform, centroid/boundary conditions, 1/3 power law, NF4 for LLMs.
Linearization, subtractive dither, NVFP4, gradient accumulation.
Fourier basis, orthogonality, DFT/IDFT, FFT butterfly.
Centroid, spread, kurtosis, entropy, flatness, flux.
Windowed DFT, spectrogram, uncertainty principle, overlap-add.
NN classifier, Hilbert spaces, Parseval, template matching.
CWT, DWT filter bank, Haar, Daubechies, multi-resolution.
Denoising, JPEG2000, wavelet families, Fourier vs wavelet.
LTI, convolution theorem, eigenvectors, circulant matrices.
Log trick, mel scale, mel filter bank, MFCC pipeline.
Bayes risk, likelihood ratio, ROC curves, Neyman-Pearson.
Stationarity, autocorrelation, Gaussian classification, LDA/QDA.
J(w) criterion, scatter matrices, simultaneous diagonalization.
Maximum margin, hard/soft margin, slack variables, multi-class.
Lagrangian, KKT, primal→dual derivation, dual SVM.
Feature maps, kernel trick, RBF, Mercer, representer theorem.
Least squares, ridge, LASSO, autoregressive models, Yule-Walker.
Bayesian regression, kernel regression, Gaussian processes.
Wiener filter, LMS algorithm, noise cancellation, echo cancel.
Hidden layers, activations, backprop, universal approximation.
Convolutional layers, spectrograms+CNN, BatchNorm, ResNets.
QKV, scaled dot-product, multi-head, transformer blocks.
Signal tokenization, causal masking, GPT-style prediction.
Forward/reverse process, noise prediction, WaveGrad.
Pilanci’s reformulation, group LASSO, double descent.
VAE, nuclear norm, Robust PCA, matrix separation.
Multiplicative updates, source separation, deep MF.
K-SVD, OMP, matching pursuit, LASSO, sparse coding.
MCU vs MPU, registers & memory map, GPIO, timers, interrupts, communication protocols, software patterns.
Bitwise ops, pointers, dynamic memory, recursion, data structures, sorting, state machines, storage maps.
SPI/I²C/UART, MCU selection, sensors, wireless (BLE/WiFi/LoRa), PCB layout, firmware testing.
Software stack, clock trees, DMA, peripheral drivers, power management, optimization, IoT case study.
STM32L475 registers, ARM assembly, timer deep dive, NVIC interrupts, ISR patterns, low-power tickless.
Rate Monotonic Analysis, scheduling algorithms, FreeRTOS, semaphores/mutexes, priority inversion.
IoT stacks, MQTT, Azure IoT Hub, sensor edge computing, BLE/LoRa, security, 5G applications.
GPIO circuits, ADC/Nyquist, UART/SPI/I²C waveforms, USB enumeration, BLE GATT, RF options.
Cross-compilation, daemons/systemd, kernel modules, device drivers, device tree, kernel building.
INT8/INT4 quantization, pruning, GPTQ/AWQ, profiling, post-training vs QAT, compression pipelines.
Depthwise separable convs, MobileNet, EfficientNet, NAS, hardware co-design, mobile deployment.
KV cache, Flash/MQA/GQA attention, continuous batching, speculative decoding, TP/PP parallelism.
The complete inference stack: tokenization, prefill/decode, GPU hardware, metrics, batching, FlashAttention, speculative decoding, quantization, parallelism, production serving.
Stanford CS 229s Fall 2023. Hardware-aware algorithm design, transformer efficiency, FlashAttention, sparsity & quantization, finetuning, parallelism, efficient architectures, cluster scheduling.
CPU vs GPU, memory hierarchy, arithmetic intensity, roofline model, tiling, operator fusion.
Training FLOPs, backward pass costs, memory analysis, prefill vs decode, KV cache, speculative decoding.
Attention taxonomy, SRAM vs HBM, tiled computation, online softmax derivation, FlashAttention-2.
Pruning (structured/unstructured), INT8/INT4, PTQ, QAT, SmoothQuant, knowledge distillation.
Zero/few-shot, CoT, instruction tuning, RLHF pipeline, reward modeling, Constitutional AI, LoRA.
SFT loss, REINFORCE derivation, PPO clipped objective, reward modeling, the full RLHF loop.
Data parallelism, AllReduce, tensor parallelism, pipeline parallelism, ZeRO stages, 3D parallelism.
State space models, discretization, HiPPO, S4, Mamba selective SSMs, convolution-recurrence duality.
Job queues, FIFO/SJF, GPU cluster analysis, Gavel heterogeneity-aware scheduling, placement, fairness.
Memory layout, strides, views vs copies, broadcasting, dtypes, GPU transfer — the byte-level foundation.
Computation graphs, chain rule, backward pass, gradient accumulation, custom functions — build micrograd.
Forward → loss → backward → step. Optimizers, dataloaders, schedulers — train a network live in-browser.
Parameter registration, hooks, containers, the __call__ protocol, JIT — what happens inside model(x).
Build a full training pipeline by hand — raw tensors to MNIST classifier — then show torch.nn does the same thing.
Gaussian, Student-t, Laplace, Beta, Gamma, Exponential, Weibull, Chi-Squared, Von Mises + 5 more.
Bernoulli, Categorical, Binomial, Poisson, Geometric, Hypergeometric, Multinomial + 2 more.
MVN, Dirichlet, Wishart, Gaussian Process, Dirichlet Process, Von Mises-Fisher + 4 more.
GMM, Particle, Horseshoe, Spike-and-Slab, Gumbel, Kumaraswamy + 4 more.
Softmax/Gibbs, SO(3)/SE(3), Normalizing Flows, Copulas, Hawkes, Diffusion + 6 more.
Estimation by a cloud of weighted guesses — for multimodal, nonlinear beliefs the Kalman filter can’t represent. Predict, weight, resample; Monte Carlo Localization; the curse of dimensionality.
The modern SLAM back-end: optimize the whole trajectory + map at once as a sparse least-squares problem. Smoothing vs filtering, variables & factors, Gauss-Newton, the loop closure that snaps a drifted map straight, and bundle adjustment.
One bad measurement wrecks least squares. The two great cures: down-weight outliers (M-estimators, Huber, IRLS) and vote them out (RANSAC). Influence functions, breakdown points, and the robust kernels behind every SLAM back-end and SfM pipeline.
The mother of all filters — recursive belief update from first principles.
Track a moving object through noise — the most elegant algorithm in engineering.
When the world is nonlinear — linearize with Jacobians.
Beyond linearization — sigma points capture nonlinearity directly.
Sequences with hidden causes — Forward, Viterbi, Baum-Welch.
Prior × likelihood = posterior. The foundation of all inference.
Variables with dependencies — DAGs, d-separation, message passing.
2R/6R robot arms, geometric subproblems, 3D IK solvers, singularities, workspace — heavy 3D & 2D sims.
The chicken-and-egg problem — EKF-SLAM, particle filters, graph-based.
IMU + camera fusion — preintegration, MSCKF, tightly-coupled.
Deep features, neural implicit maps, Gaussian splatting SLAM.
Learned inertial models, transformer odometry, foundation models.
Every symbol explained, every step derived. From REINFORCE to off-policy IS to KL constraints → PPO.
Learn to estimate what’s good vs bad. MC, bootstrapping, N-step returns, PPO, SAC — the complete guide.
The practical algorithms. Clipping, GAE, replay buffers, Q-functions — from theory to real robots.
Train policies without environment interaction. AWR, AWAC, IQL — data stitching, expectile regression, and implicit policy improvement.
Where do rewards come from? Goal classifiers, adversarial training, human preferences, Bradley-Terry, RLHF, and Constitutional AI.
No policy needed. Bellman optimality, target networks, double Q-learning, N-step returns — value-based RL.
Behavioral cloning, expressive policies, diffusion policies, compounding errors, DAgger — learning from demos.
Every derivation, every proof. MDPs → Policy Gradients → TRPO → PPO → SAC → DQN → Offline RL → DPO.
Learn a simulator of the world, then practice inside it. Dyna, MBPO, MPC, CEM, value-augmented planning, AlphaGo as MBRL.
One policy, many tasks. Task conditioning, goal-conditioned rewards, HER — turning failures into free training data.
Complete midterm prep: every algorithm, every equation, 36+ practice questions, interactive exam simulator, cheat sheet.
Noam Brown’s approach: search, self-play, and RL for superhuman reasoning in language models.
The post-training frontier: reward models, PPO alignment, direct preference optimization, and beyond.
Learning to learn new tasks from a handful of episodes. Black-box architectures, exploration strategies, task inference as POMDP.
Decompose long-horizon tasks into subtask hierarchies. HL/LL policies, goal representations, DAgger for hierarchy, HIRO, skill discovery.
Four eras of NLP: rule-based translation, hand-built AI, statistical methods, neural revolution. Understanding vs pattern matching.
One-hot limitations, distributional hypothesis, Word2Vec (CBOW & Skip-gram), GloVe, negative sampling, word analogies, embedding bias.
Neurons, activations, forward pass, loss functions, chain rule, backprop algorithm, gradient descent playground, practical tips.
N-grams, neural LMs, recurrent networks, BPTT, vanishing gradients, LSTM & GRU, live text generation, attention preview.
Self-attention, scaled dot-product, multi-head attention, positional encoding, encoder-decoder, build a Transformer piece by piece.
Debugging, hyperparameter tuning, regularization, training diagnostics dashboard, evaluation & error analysis.
BERT vs GPT paradigms, masked LM, autoregressive LM, data pipelines, scaling laws, Chinchilla, Llama 3 architecture.
SFT, reward modeling, PPO with KL constraint, DPO, alignment data, evaluation, safety guardrails.
In-context learning, prompt engineering, chain-of-thought, lottery tickets, adapters, LoRA, PEFT playground.
Retrieval-augmented generation, dense retrieval, ReAct reasoning loops, Toolformer, agent simulator.
MMLU, HELM, LLM-as-judge, benchmark explorer dashboard, data contamination, metric saturation.
Chain-of-thought, self-consistency, zero-shot CoT, process rewards, DeepSeek-R1, GRPO & DAPO.
Process reward models, best-of-N, speculative decoding, RoPE, context extension, test-time compute scaling.
BPE, WordPiece, SentencePiece, multilingual fertility, XLM-R cross-lingual transfer, tokenizer explorer.
Linear probes, attention visualization, sparse autoencoders, agentic interpretability, concept discovery.
Bias, toxicity, misinformation, privacy, environmental cost, governance & policy.
Vision encoders, early/late fusion, Chameleon, Transfusion, scaling mixed-modal, retrieval-augmented multimodal.
When LoRA fails, regularization in PEFT, merging adapters, QLoRA, DoRA, future of adaptation.
Reasoning, grounding, efficiency, safety, multimodal frontier, interactive research map.
Why graphs? Node/edge/graph-level tasks. AlphaFold, PinSage, drug interactions, antibiotic discovery.
Encoder-decoder framework, DeepWalk, node2vec (biased walks with p,q), negative sampling, matrix factorization view.
Message passing, computation graphs, GCN (mean aggregation), GNNs generalize CNNs, transformers as GNNs.
Message + aggregation framework. GCN vs GraphSAGE vs GAT. Multi-head attention. Skip connections, over-smoothing.
Feature/structure augmentation, virtual nodes, neighbor sampling, prediction heads, loss functions, dataset splitting.
How powerful are GNNs? WL test, multiset injectivity, why GCN/GraphSAGE fail, GIN achieves maximum expressiveness.
Beyond WL: Laplacian eigenvectors, structural features, position-aware GNNs, anchor sets, higher-order k-WL.
Self-attention on graphs, positional encodings (Laplacian, random walk), Graphormer, GPS hybrid architecture.
Multiple node/edge types. RGCN (relation-specific weights), basis decomposition, HGT, metapaths.
TransE, TransR, DistMult, ComplEx, RotatE. Relation patterns: symmetry, composition, 1-to-N. KG completion.
Bipartite graphs, NGCF, LightGCN, BPR loss, PinSage scalability, cold start, GNN vs LLM.
Databases ARE graphs. Foreign keys as edges, GNNs on relational data, RelBench, vs XGBoost.
Temporal message passing, multi-hop aggregation, Griffin universal encoder, schema-specific vs universal.
Graph foundation models, pre-training, explainability, equivariant GNNs, dynamic graphs, practical tips.
Inductive KG reasoning, ULTRA, cross-schema transfer, foundation models for knowledge graphs.
LLMs as feature encoders, GNNs as structure encoders, joint pipelines, graph-augmented LLMs.
ReAct, Reflexion, graph-grounded agents traversing KGs step by step. Tool use, PPO/DPO optimization, STaRK benchmarks.
GraphRNN two-RNN sequential generation, GCPN RL-guided design, JT-VAE motifs, diffusion on graphs, molecular benchmarks.
Full arc from node embeddings to graph foundation models. Open problems, research frontiers, practical advice for GNN practitioners.
k-NN, L1/L2 distance, hyperparameter tuning, cross-validation.
Score function Wx+b, hinge loss, cross-entropy, regularization.
MSE, cross-entropy, KL divergence, Huber, contrastive, triplet, InfoNCE — every loss derived from scratch with interactive comparisons.
Gradients, SGD, chain rule, computation graphs, gate patterns.
Neurons, activation functions, layers, representational power.
Data preprocessing, weight init, batch norm, dropout.
Loss curves, learning rates, momentum, Adam, hyperparameter search.
Build a 2-layer net from scratch. Live training visualization.
Convolution, filters, pooling, LeNet to ResNet.
Feature extraction vs fine-tuning, freezing layers.
RNNs, LSTM, BPTT, vanishing gradients, char-level LM.
Perception-action loop, model-based planning, imitation learning, diffusion policies, VLAs & foundation models.
CLIP, LLaVA, Flamingo, Molmo, SAM — contrastive learning, VLMs, promptable segmentation, model chaining.
Point clouds, meshes, SDFs, PointNet, NeRF, 3D Gaussian Splatting — from representations to neural rendering.
L1/L2, cross-entropy, SGD, momentum, Adam, learning rate schedules, weight initialization.
Computational graphs, chain rule, gate patterns, Jacobians — backprop from first principles.
Convolution operation, stride, padding, pooling, receptive field, depthwise separable convolutions.
Data augmentation, batch norm, dropout, AlexNet → VGG → ResNet → EfficientNet → ConvNeXt.
Vanilla RNN, BPTT, vanishing gradients, LSTM gates, GRU, char-level LM, image captioning.
Bahdanau attention, self-attention, multi-head, transformer blocks, positional encoding, ViT.
FCN, U-Net, R-CNN family, YOLO, Mask R-CNN, GradCAM, adversarial examples.
Two-stream networks, 3D convolutions, I3D, SlowFast, TSM, video transformers, VideoMAE.
Single-frame to SlowFast to video transformers. Two-stream, 3D conv, skeleton-based, temporal detection.
Brightness constancy, Lucas-Kanade, Horn-Schunck, FlowNet, RAFT. Dense motion estimation from zero.
GPU hardware, data/pipeline/tensor parallelism, FSDP, ring all-reduce, 3D parallelism — training at scale.
Pretext tasks, MAE, InfoNCE, SimCLR, MoCo, CPC, DINO — learning without labels.
Density functions, MLE, autoregressive models, PixelCNN, autoencoders, VAEs — the full ELBO derivation.
GANs, rectified flow, classifier-free guidance, latent diffusion, DiT, text-to-image & video generation.
Research-backed system design lessons. Real architectures, real numbers, real tradeoffs — with interactive Canvas simulations that trace requests, visualize scale, and simulate failures.
Practical distributed systems engineering — networking, consensus, CRDTs, transactions, caching, load balancing, resiliency patterns, observability, and deployment. Interview-grade depth with interactive simulations.
TCP, TLS, flow control, congestion control, QUIC — the networking layer.
DNS, REST APIs, gRPC, idempotency — how services find and talk to each other.
Failure detection, physical/logical/vector clocks, HLC — time in distributed systems.
Raft, state machine replication, consistency models, chain replication.
CRDTs, gossip protocols, Dynamo, CALM theorem, causal consistency.
ACID, isolation levels, 2PC, sagas, outbox pattern.
HTTP caching, reverse proxies, CDN architecture, cache invalidation.
Range/hash partitioning, consistent hashing, blob storage.
DNS, L4, L7 load balancing — distributing traffic across servers.
Replication, NoSQL taxonomy, caching patterns, eviction policies.
Microservices, API gateways, service mesh, control/data planes.
Message queues, pub/sub, Kafka, exactly-once, backpressure.
Cascading failures, redundancy, shuffle sharding, cellular architecture.
Timeouts, retries, circuit breakers, rate limiting, load shedding.
Test pyramid, chaos engineering, CI/CD, canary deploys, rollbacks.
Metrics, SLIs/SLOs, alerting, logs, distributed tracing, dashboards.
Deep interactive lessons from foundational distributed systems papers. Every definition derived, every algorithm implemented, every proof traced.
Chapter-by-chapter deep dive into the classic algorithms textbook. Each chapter is a standalone interactive lesson with interview-grade depth, Canvas simulations, coding drills, and mastery challenges.
Insertion sort, merge sort, algorithm analysis — the foundations of algorithmic thinking.
Big-O, Omega, Theta — the language of algorithm efficiency.
Max subarray, Strassen, Master theorem — breaking problems into pieces.
Binary heaps, heapsort, priority queues — scheduling and selection.
Partition, randomize, conquer — the fastest practical sorting algorithm.
Counting sort, radix sort, bucket sort — breaking the O(n log n) barrier.
Hashing, chaining, open addressing, perfect hashing — O(1) lookup.
Search, insert, delete, traverse — ordered data made fast.
Self-balancing BSTs — guaranteed O(log n) with coloring rules.
Rod cutting, LCS, edit distance, knapsack — the paradigm that dominates interviews.
Activity selection, Huffman, fractional knapsack — locally optimal = globally optimal.
Aggregate, accounting, potential — the true cost of operation sequences.
BFS, DFS, topological sort, SCCs — the foundation of every graph question.
Kruskal, Prim, Union-Find — connecting everything at minimum cost.
Dijkstra, Bellman-Ford, DAG shortest paths — optimal routes through graphs.
Ford-Fulkerson, max-flow min-cut, bipartite matching — optimizing network flow.
Quickselect, median of medians — finding the k-th element in linear time.
Disk-optimized search trees — billions of keys with just 2-3 disk reads.
Amortized O(1) decrease-key — the theoretical champion for graph algorithms.
O(log log u) predecessor queries — when the universe is your friend.
Union-Find — nearly O(1) connected component queries with path compression.
Floyd-Warshall, Johnson's — shortest paths between every pair of vertices.
Fork-join, work/span, parallel merge sort — harnessing multiple cores.
LU decomposition, least squares, Cholesky — the linear algebra engine.
Simplex, duality, LP reductions — the universal optimization framework.
DFT, FFT, convolution — O(n log n) polynomial multiplication.
GCD, modular arithmetic, RSA, primality — the math behind cryptography.
KMP, Rabin-Karp, finite automata — finding patterns in text efficiently.
Convex hull, closest pair, line intersection — algorithms in 2D space.
P vs NP, reductions, SAT, TSP — the limits of efficient computation.
Vertex cover, TSP, set cover — good-enough solutions with guarantees.
Databases, ML, compilers, networking, crypto, graphics — where every CLRS chapter shows up.
Chapter-by-chapter deep dive into Martin Kleppmann's DDIA. Each chapter is a standalone interactive lesson with interview-grade depth, Canvas simulations, design challenges, debug scenarios, and mastery components.
Reliability, scalability, maintainability — the vocabulary of systems design interviews.
SLAs, percentiles, capacity planning, load testing — quantifying system quality.
Relational, document, graph — choosing how to structure and query your data.
B-trees, LSM-trees, column stores — how databases actually store and find your data.
JSON, Protobuf, Avro, schema evolution — how data survives change.
Leader-follower, multi-leader, leaderless — keeping copies consistent across machines.
Hash partitioning, range partitioning, rebalancing — splitting data across machines.
ACID, isolation levels, MVCC, serializability — the guarantees that keep your data correct.
Network faults, clock drift, process pauses — everything that can go wrong, will.
Linearizability, Raft, Paxos — how distributed nodes agree on the truth.
MapReduce, Spark, dataflow engines — processing massive datasets efficiently.
Kafka, Flink, event sourcing, windowing — processing data as it arrives.
Streaming philosophy, bias, privacy, responsibility — the human side of data systems.
54 hands-on exercises covering scaling laws, chinchilla, compute-optimal training, loss prediction, data mixing, emergent abilities.
52 exercises across 5 modes: derive, trace, build, design, debug. Parameter counts, attention FLOPs, memory budgets, KV caches, throughput estimation.
Chain rule by hand, gradient shapes, cross-entropy, Adam optimizer, batch norm, learning rate schedules, mixed precision, training diagnostics.
Bayes rule, distributions, MLE, MAP estimation, Kalman filter math, HMM forward/Viterbi, information theory, sensor fusion.
Bellman equations, value iteration, Q-learning updates, policy gradients, advantage estimation, PPO clipping, exploration strategies.
Forward process, noise schedules, DDPM loss, score functions, sampling, classifier-free guidance, latent diffusion, flow matching, ODE/SDE.
Tensor/pipeline/data parallelism, continuous batching, PagedAttention, speculative decoding, quantization, cost optimization.
Rotation matrices, homogeneous transforms, EKF predict/update, SLAM graphs, inverse kinematics, Jacobians, PID control.
Implement softmax, linear layers, attention, layer norm, positional encoding, BPE tokenizer, cross-entropy, Adam, KV cache, sampling.
Asymptotic analysis, recurrences, sorting, hash tables, BSTs, dynamic programming, greedy, graph algorithms, shortest paths, MST.
Multi-dimensional KF, EKF Jacobians, range-bearing updates, UKF sigma points, particle filters, sensor fusion, observability, noise tuning.
EKF-SLAM, landmark observation, data association, loop closure, graph SLAM, visual odometry, RANSAC, factor graphs, occupancy grids.
Graph basics, node embeddings, GCN/SAGE/GAT, over-smoothing, spectral theory, knowledge graphs, link prediction, graph generation.
Word vectors, Word2Vec gradients, dependency parsing, perplexity, attention, subword tokenization, pretraining, PEFT, evaluation metrics.
Discrete signals, DFT by hand, quantization, Lloyd-Max, STFT, wavelets, MFCC pipeline, Bayes classifiers, SVM margins.
Ring AllReduce, ZeRO stages, gradient accumulation, LR scaling, pipeline parallelism, tensor parallelism, checkpointing, mixed precision.
Convolution math, receptive fields, pooling, anchors/IoU, NMS, ResNet, segmentation, stereo vision, homography.
Matrix operations, Gaussian elimination, eigenvalues, SVD, PCA, matrix calculus, least squares, norms, positive definiteness.
Floating point, catastrophic cancellation, condition numbers, iterative methods, Newton's method, gradient descent, integration, sparse matrices.
Entropy, joint/conditional entropy, KL divergence, cross-entropy loss, source coding, channel capacity, rate-distortion, VAE connection.
Convexity, gradient descent, SGD variants, constrained optimization, KKT conditions, duality, proximal operators, convergence rates, Newton's method.
Reward modeling, RLHF objective, PPO for RLHF, DPO loss, CQL, IQL, model-based RL, reward shaping, multi-agent RL.
Mode collapse, flow matching arithmetic, REINFORCE credit assignment, actor-critic n-step returns, Q-learning TD targets, offline RL, IQL expectile loss, goal-conditioned RL.
Bayesian networks, Dirichlet inference, value of information, Bellman backups, kernel smoothing, MCTS/UCB1, policy gradients, POMDPs, alpha vectors.
Quantization (PTQ, QAT, STE), pruning (magnitude, structured, lottery ticket), knowledge distillation, LoRA, mixed precision, compression pipelines.
Deep-dive lessons built around real engineering roles. Classical foundations, modern tools, real papers, interactive sims. Built for interview prep with real depth.
Customer discovery, rapid prototyping, production deployment at customer sites, SDK integration, debugging in customer environments, demo engineering, incident response.
RAG architecture, agent design, fine-tuning, prompt engineering, evaluation, streaming, production AI infrastructure, safety & guardrails.
Information retrieval, dense embeddings, learning to rank, recommendation systems, serving a web index, the convergence of search + recs + transformers.
API design, request lifecycle, database optimization, caching, rate limiting, auth, observability, scaling patterns, developer experience.
GPU clusters, LLM serving, distributed systems, cost optimization, reliability engineering, CI/CD, monitoring — keeping AI systems running at scale.
Docs, SDKs, community, content strategy, conference talks, launches, feedback loops, measuring impact — with AI company DevRel woven throughout.
Experiment-driven sprints, eval-based acceptance criteria, data pipeline management, research-to-production handoffs, agentic AI project management.
DVA model testing, sim-to-real validation, safety compliance, CI/CD for robotics, 20 interview questions + cheat sheet.
Camera models, epipolar geometry, SfM, bundle adjustment, SLAM, dense reconstruction, point tracking, DUSt3R/VGGT — the full stack for robotics perception.
GPU architecture, profiling, mixed precision, distributed training, model compression, TensorRT, serving at scale, real-time autonomous driving stacks.
Physics simulation, contact models, MuJoCo/Isaac Sim, kinematics, motion planning, control, sim-to-real, RL, ROS2 — the full robotics stack.
Quantization, CUDA kernels, TensorRT, FlashAttention, KV-cache, distributed training, BEV perception, edge deployment, VLA — the full AV inference stack.
Test design, automation, reliability engineering, incident response, safety-critical testing, debugging, observability — onsite interview prep for robotics QE.
Agent SDK, RAG, evaluation, data pipelines, inference serving, monitoring, security, distributed systems — the full agentic AI platform stack.
Pre-training, RLHF, evals, safety, scaling laws, infra, on-call, paper reading, experiment tracking — what a frontier lab researcher actually does.
The practical toolkit for building with AI models. Embeddings, retrieval, RAG, agents, evals, safety — everything between "I have a model" and "I have a production app."
What embeddings are, how they encode meaning, distance in high-D space, embedding models, multimodal embeddings.
Cosine, dot product, Euclidean, Jaccard, learned metrics — measuring "how alike" in vector space.
FAISS, HNSW, IVF, product quantization — storing and searching millions of vectors in milliseconds.
Fixed, sentence, recursive, semantic, structure-aware — how to split documents for embedding and retrieval.
The full pipeline: chunk → embed → store → retrieve → rerank → generate. Eval, failure modes, advanced patterns.
Images, tables, PDFs in RAG. Vision embeddings, ColPali, document parsing, hybrid retrieval.
USB-C for AI: tools, resources, prompts. Build MCP servers and clients. The standard for AI integrations.
System prompts, few-shot, chain-of-thought, structured output, testing — systematic, not vibes-based.
LLM-as-judge, human eval, automated metrics, regression detection, A/B testing for AI systems.
Content filtering, jailbreak prevention, PII detection, output validation, red teaming, defense in depth.
When to fine-tune vs prompt, data prep, LoRA/QLoRA, eval pipelines, deployment, cost analysis.
Function calling, ReAct, multi-step agents, state management, error recovery, guardrails for agents.
Record yourself teaching on a whiteboard with optional camera. Export as MP4. The ultimate Feynman test: if you can teach it, you understand it.