Vea nors

10 chaptersPhysical Intelligence

pi-0: VLA Flow Model for General Robot Control

The foundation model — flow matching for continuous robot actions across 7 robot types.

11 chaptersPhysical Intelligence

pi-0.5: Open-World Generalization

First VLA to clean kitchens in new homes via heterogeneous co-training.

Knowledge Insulating VLAs

Train fast, run fast, generalize better — decoupling VLM knowledge from action learning.

pi*0.6: Learning From Experience

The first VLA that improves from its own real-world experience via RL.

8 chaptersPhysical Intelligence

Helix: Training-Time Action Conditioning

Efficient real-time chunking via training-time action conditioning.

8 chaptersPhysical Intelligence

Real-Time Chunking for Flow Policies

Making action chunking flow policies run in real-time.

FAST: Efficient Action Tokenization

5-10x fewer action tokens via DCT-based compression.

MEM: Multi-Scale Embodied Memory

Giving VLAs memory across multiple time scales for long-horizon tasks.

Human to Robot Transfer

How internet video of humans performing tasks transfers to robot control.

8 chaptersPhysical Intelligence

RL Token: Bootstrapping RL with VLAs

Fine-tuning VLAs with RL by treating reward as a special token.

10 chaptersPhysical Intelligence

π0.7: Steerable Generalist Foundation Model

Diversified prompts unlock emergent dexterity, cross-embodiment transfer, and compositional generalization.

2604.19728 · 2026

VLA Foundry: Unified LLM→VLM→VLA Training

TRI’s open-source framework unifying the full pipeline. Qwen3-VL backbone beats LBM by 23pp on bimanual manipulation.

10 chaptersMercat, Keh et al. (TRI)

2510.13054 · 2025

VLA-0: Zero-Modification VLAs

Predict robot actions as plain text — no new tokens, no action heads, no architectural changes. Surprisingly SOTA on LIBERO.

10 chaptersGoyal et al. (NVIDIA)

2602.04215 · 2026

OAT: Ordered Action Tokenization

Three desiderata for action tokens: compression, decodability, ordering. Nested dropout + FSQ = anytime coarse-to-fine robot actions.

10 chaptersLiu, Han et al. (Harvard + Stanford)

Music & Audio Generation AI18

ICLR 2020

DDSP: Differentiable Digital Signal Processing

Put oscillators and filters inside the network and backprop through them — high-fidelity audio from 13 min of data, with timbre transfer and dereverberation for free.

10 chaptersEngel et al., Google Brain

11 chLyria Team, Google DeepMind

Live Music Models: Magenta RealTime

An open-weights model that streams unbroken audio in real time and bends to text and audio prompts mid-performance.

10 chKim, Brade, Donahue, Huang et al. (CMU/MIT/UCSD/KAIST)

A Design Space for Live Music Agents

184 systems, 40 years, 3 fields — mapped into 31 dimensions for systems that listen and respond to musicians live.

11 chvan den Oord et al. (DeepMind)

WaveNet: A Generative Model for Raw Audio

Generate raw sound one sample at a time — 16,000 predictions/sec — with dilated causal convolutions, no recurrence.

10 chvan Merrienboer et al. (Google DeepMind)

Perch 2.0: The Bittern Lesson for Bioacoustics

A tiny supervised model trained on 14,795 species beats every self-supervised model at animal sound — and transfers to whales.

11 chEngel, Resnick, Roberts et al.

Neural Audio Synthesis with WaveNet Autoencoders

Generate instrument sounds one sample at a time — and discover a timbre space where you can morph a flute into an organ.

11 chNovack, Brade et al. (UCSD/MIT/Adobe)

Live Music Diffusion Models

Turn a slow offline audio diffusion model into a real-time instrument you can jam with — by making KV-caching work for diffusion.

11 chKarchkhadze & Dubnov (UCSD/IRCAM)

Real-Time Musical Co-Performance with a Look-Ahead Diffusion Model

A latent diffusion model jams with a live human, generating ahead of the beat to hide latency — distilled 5.4× faster.

11 chNovack et al. (UCSD/Sony)

FlashFoley: Fast Interactive Sketch2Audio Generation

Hum a sound and hear realistic Foley in 75 ms — fine-grained control plus real-time streaming in one open-source model.

10 chWu, Kim, Huang (MIT)

MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation

Bolt 55,000 music tokens onto a 1B text LLM, train in two stages, and it writes editable multitrack MIDI 50× faster.

12 chHuang, Yang, Chen et al. (Smule/UCSD/Rochester)

StylePitcher: Style-Following Pitch Curves for Singing

One model learns a singer’s vibrato and ornaments from a few seconds, then fills missing pitch — no task-specific retraining.

10 chDixit, Heller & Donahue (CMU)

Vision Language Models Are Few-Shot Audio Spectrogram Classifiers

Turn a sound into a picture of its frequencies, hand it to GPT-4o with a few examples, and it recognizes the sound — beating experts.

11 chDonahue et al. (Google Research)

SingSong: Generating Musical Accompaniments from Singing

Sing into a mic and get back a full instrumental band locked to your voice — trained on pairs made by a source separator.

10 chKim, Saito, Donahue et al. (Sony AI/CMU)

Music Arena: Live Evaluation for Text-to-Music

Let users blind-battle two music generators on real prompts — turning clicks into a Bradley-Terry leaderboard and preference data.

11 chWang, Lindlbauer & Donahue (CMU)

Music-Aware Virtual Assistants Notifications that Sing

Your assistant sings notifications in your song’s key, on its beat, in place of the chorus — instead of barking over the music.

11 chThickstun, Hall, Donahue & Liang (Stanford/CMU/DeepMind)

The Anticipatory Music Transformer

Give a model a melody and it composes accompaniment — by interleaving controls early so a plain GPT can plan ahead.

11 chWu, Donahue, Watanabe & Bryan (CMU/Adobe)

Music ControlNet: Multiple Time-varying Controls for Music Generation

Borrow ControlNet from image generation for sound — frame-level control over melody, dynamics, and rhythm at once.

11 chAgostinelli, Denk, Borsos et al. (Google Research)

MusicLM: Generating Music From Text

Type a prompt, get minutes of coherent 24 kHz music — no paired data, via three frozen tokenizers and three Transformers.

Model Compression & Quantization22

11 chGuo, Qiu, Leng et al. (SJTU/MSR/Alibaba)

SQuant: On-the-Fly Data-Free Quantization

Quantize a network to 4-bit in under a second — no data, no backprop — by minimizing the signed sum of rounding errors.

11 chYvinec, Dapogny, Cord & Bailly (Sorbonne/Datakalab)

PowerQuant: Automorphism Search for Non-Uniform Quantization

Compress a network to 4 bits without training data — replacing the even grid with a learned power map that keeps integer math intact.

10 chYvinec, Dapogny & Bailly (Datakalab/Sorbonne)

NUPES: Non-Uniform Quantization via Power Exponent Search

Bend the quantization grid with one learnable exponent, then learn new quantized weights by gradient descent. SOTA W4/A4.

11 chHaroush, Hubara, Hoffer & Soudry (Habana/Technion)

The Knowledge Within: Data-Free Model Compression

Quantize and fine-tune to 4-bit without touching training data — by mining synthetic images from baked-in BatchNorm stats.

11 chWortsman, Farhadi & Rastegari (AI2/UW/XNOR)

Discovering Neural Wirings

Treat the whole network as one graph, fix a budget of k edges, and let training decide the wiring — one run, no retraining.

10 chYvinec, Dapogny & Bailly (Sorbonne/Datakalab)

Shrinking a Model with a Look-Up Table JLCM: Jointly Learnable Codebooks & Mappings

Replace every weight with a 2-bit index into a tiny learned table — learned end-to-end. Llama 7B drops 14 GB to 2 GB.

10 chWang, Liu, Lin, Lin & Han (MIT)

HAQ: Hardware-Aware Automated Quantization with Mixed Precision

Let an RL agent assign each layer its own bitwidth — guided by direct latency and energy feedback from the actual chip.

10 chZhe, Lin, Chandrasekhar & Girod (Stanford/A*STAR)

Optimal Bit Allocation for Compressing Weights & Activations

Treat quantization as rate–distortion: one Lagrangian condition tells you exactly how many bits each layer should get.

2021

HAWQ-V3: Dyadic Neural Network Quantization

Run a whole network with only integer multiply, add, and bit-shift — an ILP decides which layers get 4 vs 8 bits.

10 chYao, Dong, Gholami et al. (Berkeley/Amazon)

10 chCai, Yao, Dong et al. (Berkeley/Peking U.)

ZeroQ: Mixed-Precision Quantization With No Data At All

Compress to 4-bit with no data — reconstruct synthetic data from BatchNorm stats, then find per-layer bits in under 30 sec.

10 chChen, Meng, Zhang et al. (CASIA/Horizon Robotics)

JASQ: Joint Architecture Search & Quantization

Fold network design and per-layer bitwidth choice into one evolutionary search on a single GPU.

11 chSuau, Zappella & Apostoloff (Apple)

Principal Filter Analysis: Compress a Network by Its Echoes

If two filters always fire together, one is redundant. PFA measures this from responses and tells you how many filters to keep.

11 chCho, Adya, Naik (Apple)

PDP: Parameter-free Differentiable Pruning

Make pruning a smooth, trainable decision with no extra parameters — a softmax against one threshold lets killed weights return.

10 chKrishnamoorthi (Google)

Quantizing Deep ConvNets for Efficient Inference

Shrink a 32-bit network into 8-bit integers — 4× smaller, 2–3× faster — the whitepaper that defined the PTQ/QAT vocabulary.

10 chCho, Alizadeh-Vahid, Adya, Rastegari (Apple)

DKM: Differentiable k-Means Clustering for Compression

Squeeze ResNet50 to 3.3 MB by treating hard k-means as soft attention, so the codebook trains end-to-end against the task loss.

9 chCheng, Wang, Zhou & Zhang (IEEE SPM)

A Survey of Model Compression & Acceleration

Four families that shrink a network 5–50× without losing accuracy: pruning & quantization, low-rank, compact filters, distillation.

11 chO’Neill (University of Liverpool)

A Survey of Neural Network Compression

Five ways to shrink an overparameterized network — weight sharing, pruning, low-rank, distillation, quantization — in one map.

11 chZhu, Li, Liu, Ma & Wang (CAS)

Shrinking the Giant: Model Compression for LLMs

Four families — quantization, pruning, distillation, low-rank — that crush a billion-parameter LLM to deployable size.

10 chKusupati, Ramanujan, Somani et al. (UW/AI2/MSR)

Learning Sparsity Itself: Soft Threshold Reparameterization

Make the pruning threshold a parameter and push it down the gradient — the network discovers its own per-layer sparsity budget.

10 chColbert, Pappalardo, Petri-Koenig, Umuroglu (AMD)

A2Q+: Accumulator-Aware Weight Quantization

Everyone shrinks weights; almost no one shrinks the accumulator. A2Q+ guarantees no overflow at 12 or 8 bits, +17% accuracy.

10 chZhu (Stanford) & Gupta (Google)

To Prune, or Not to Prune: Gradual Magnitude Pruning

Train big and delete most weights, or train small dense? A four-symbol cubic schedule answers it — big-but-sparse wins.

12 chYuan, Chen, Hu & Peng (IEEE IJCNN)

EvoQ: Mixed-Precision Quantization via Sensitivity-Guided Evolutionary Search

Measure which layers survive 2-bit, then run a sensitivity-seeded genetic search for the best per-layer bitwidth — no retraining.

Vision, 3D & Embodied — New11

2606.10804 · 2026

SCAIL-2: Animating Any Character End-to-End

Delete the lossy skeleton — feed the whole driving video. In-context mask conditioning + reverse-driving synthetic data + hand-focused DPO unify every character-animation task.

10 chaptersYan, Guo, Yang & Tang (Tsinghua, Z.ai)

NVIDIA 2026

MotionBricks: Real-Time Motion from a Modular Latent Brain

One frozen backbone models 350k motion clips at 15,000 FPS — a multi-head tokenizer + smart primitives you assemble like LEGO, zero-shot on game characters and a real Unitree G1.

11 chaptersWang, Dionne, Peng, Zhu et al. (NVIDIA)

DeepMind 2025

D4RT: Dynamic 4D Reconstruction & Tracking

Encode a video once, then query any point's 3D position at any time from any camera — one interface for depth, tracks, point clouds & pose, at 200+ FPS.

11 chaptersZhang, Sajjadi et al. (Google DeepMind)

11 chDeitke, Clark, Lee et al. (AI2/UW)

Molmo & PixMo: Open VLMs Without Distillation

A SOTA vision-language model built without GPT-4 labels — via data-collection tricks and a dataset that lets it point at pixels.

10 chClark, Zhang, Ma et al. (AI2/UW)

Molmo2: Open Video VLMs That Point and Track

A fully-open video VLM that tells you when and where things happen — emitting spatio-temporal points, no distillation.

12 chYilmaz et al. (RWTH Aachen)

Volume Transformer (Volt): a Vanilla Transformer for 3D Scenes

Drop the specialized 3D backbones — a plain ViT-style encoder with global attention and 3D RoPE becomes SOTA, and scales past the specialists.

11 chSAM 3D Team (Meta Superintelligence Labs)

SAM 3D: 3Dfy Anything in a Single Image

Recover full 3D shape, texture, and layout for any object in a cluttered photo — via a human-and-model-in-the-loop flywheel.

11 chWang, Liu, Yu et al. (NVIDIA)

LocateAnything: Parallel Box Decoding

VLMs spell out boxes one digit at a time. LocateAnything emits a whole box in one parallel step — up to 2.5× throughput.

10 chDiao, Wang, Wu, Liu et al. (NTU/SenseTime)

NEO-ov: Pixels to Words Without an Encoder

A decoder-only transformer that learns to see — no CLIP, no projector — sending raw patches and words into the same stream.

11 chHuang, Chao, Mousavian et al. (Stanford/NVIDIA)

PointWorld: Scaling 3D World Models

A robot glances once, imagines its hand reaching in, and predicts how every 3D point moves — in 0.1 sec, no demos.

11 chShen, Kumar et al. (MIT CSAIL/UPenn)

TiPToP: A Planner That Just Works on Pixels

Bolt pretrained vision foundation models onto a GPU task-and-motion planner — matching a VLA, with zero robot data.

Qwen-VLA: One Brain for Every Robot Body

One vision-language-action model drives manipulators, humanoids, and navigating robots — via a shared next-chunk-of-motion task.

11 chQwen Team (Alibaba)

Thinking Machines Lab — Foundation Techniques2

2210.02747 · 2022

Flow Matching for Generative Modeling

Simulation-free training of CNFs via vector field regression. OT paths for faster sampling.

9 chaptersLipman et al.

2407.15835 · 2024

dMel: Speech Tokenization Made Simple

Training-free discrete speech representation from discretized mel spectrograms.

9 chaptersBai et al.

Ilya 30u30 — Foundational Reading List16

2004

Minimum Description Length

Learning as data compression. Kolmogorov complexity, crude vs refined MDL.

10 chGrünwald

Order Matters: Seq2Seq for Sets

Read-Process-Write for set inputs, ordering effects on chain rule.

10 chVinyals et al.

GPipe: Pipeline Parallelism

Micro-batch splitting, bubble time analysis, scaling to 557M params.

10 chHuang et al.

Deep Residual Learning (ResNet)

Skip connections, degradation problem, 152-layer networks.

10 chHe et al.

Dilated Convolutions

Exponential receptive fields without pooling for dense prediction.

10 chYu & Koltun

Neural Message Passing

MPNN unifying framework for graph neural networks.

10 chGilmer et al.

Attention Is All You Need

THE Transformer paper — self-attention, multi-head, positional encoding.

11 chVaswani et al.

2015

Bahdanau Attention

Original attention mechanism for neural machine translation.

10 chBahdanau et al.

Identity Mappings (ResNet v2)

Pre-activation BN-ReLU, 1001-layer networks.

10 chHe et al.

Relation Networks

Pairwise relational reasoning for visual QA.

10 chSantoro et al.

Variational Lossy Autoencoder

VAE posterior collapse, bits-back coding, information flow control.

10 chChen et al.

Relational Recurrent Neural Networks

Attention over memory slots, memory-memory interactions.

10 chSantoro et al.

2014

Coffee Automaton Complexity

Rise and fall of complexity in closed systems.

10 chAaronson et al.

2014

Neural Turing Machines

External memory, differentiable addressing, copy/sort tasks.

10 chGraves et al.

Deep Speech 2

End-to-end speech recognition, CTC loss, English + Mandarin.

10 chAmodei et al.

10 chaptersChi, Xu et al.

Scaling Laws for Neural LMs

Power law scaling over 7 orders of magnitude.

10 chKaplan et al.

Deep Reinforcement Learning23

2303.04137 · 2023

Diffusion Policy

Visuomotor policy learning via action diffusion — 46.9% average improvement across 15 tasks.

2304.13705 · 2023

ACT: Action Chunking with Transformers

Low-cost bimanual manipulation via CVAE-based action chunking — 80-90% on fine tasks from 10 min of demos.

10 chaptersZhao et al.

1992

REINFORCE

The foundational policy gradient — statistical gradient-following via the log-derivative trick.

10 chaptersWilliams

NeurIPS 1999

Policy Gradient Theorem

The theorem that enabled actor-critic methods — policy gradients with function approximation.

10 chaptersSutton et al.

1707.06347 · 2017

Proximal Policy Optimization (PPO)

The workhorse of modern RL — clipped surrogate objective for stable, efficient policy learning.

10 chaptersSchulman et al.

1506.02438 · ICLR 2016

Generalized Advantage Estimation (GAE)

The exponentially-weighted advantage estimator that controls bias-variance tradeoff — used by PPO and every modern actor-critic.

10 chaptersSchulman et al.

1312.5602 · 2013

DQN: Playing Atari with Deep RL

CNN + Q-learning + experience replay — the paper that launched deep RL from raw pixels.

10 chaptersMnih et al.

1509.06461 · 2015

Double DQN

One-line fix to Q-learning's overestimation bias — decouple selection from evaluation.

10 chaptersvan Hasselt et al.

1707.06887 · 2017

Distributional RL (C51)

Learn the full return distribution, not just the mean — Wasserstein contraction and 51-atom categories.

10 chaptersBellemare et al.

2110.06169 · 2021

Implicit Q-Learning (IQL)

Offline RL without querying OOD actions — expectile regression on the value function for implicit policy improvement.

10 chaptersKostrikov et al.

1706.03741 · 2017

Deep RL from Human Preferences (RLHF)

Learn reward from pairwise human comparisons — the paper that launched RLHF and made ChatGPT possible.

10 chaptersChristiano et al.

2305.18290 · 2023

Direct Preference Optimization (DPO)

Your language model is secretly a reward model — skip the RL loop and align with a simple classification loss.

10 chaptersRafailov et al.

10 chaptersGopalan, Chowdhury & Banerjee

Why DPO Is a Misspecified Estimator

DPO projects rewards onto a low-dimensional manifold — causing preference reversal and reward degradation. AuxDPO fixes it with nullspace degrees of freedom.

2502.16182 · 2025

Implicit Preference Optimization (IPO)

Your language model is secretly a preference classifier — extract P("Yes")/P("No") to self-judge and self-improve via DPO.

10 chaptersGarg et al.

1906.08253 · 2019

MBPO: When to Trust Your Model

Short model rollouts branched from real data — 10x sample efficiency of model-free methods with matching asymptotic performance.

10 chaptersJanner et al.

1707.01495 · 2017

Hindsight Experience Replay (HER)

Failed episode? Pretend the achieved state was the goal — sparse rewards solved via implicit curriculum.

10 chaptersAndrychowicz et al.

1611.02779 · 2016

RL²: Fast RL via Slow RL

Learn the RL algorithm itself — encode it in an RNN's weights via meta-learning across tasks.

10 chaptersDuan et al.

2008.02790 · 2021

DREAM: Decoupled Exploration & Exploitation

Break meta-RL's chicken-and-egg problem by separating exploration and exploitation objectives.

10 chaptersLiu et al.

2204.01691 · 2022

SayCan: Grounding Language in Affordances

LLMs know what to do, robots know what they can do — multiply the two for grounded planning.

10 chaptersAhn et al.

1804.10332 · 2018

Sim-to-Real: Agile Quadruped Locomotion

Domain randomization + accurate actuator models enable RL policies to transfer from sim to real robots.

10 chaptersTan et al.

2402.10329 · 2024

UMI: Universal Manipulation Interface

In-the-wild robot teaching without in-the-wild robots — handheld gripper demos transfer to any robot via relative actions.

10 chaptersChi, Xu et al.

2211.07819 · 2022

General Intelligence Requires Rethinking Exploration

The bottleneck isn't better models — it's better data. A unified framework for exploration across SL and RL, and the path to open-ended learning.

10 chaptersJiang, Rocktäschel, Grefenstette

2004.07219 · 2020

D4RL: Datasets for Offline RL

The benchmark that shaped offline RL. Maze2D, AntMaze, Adroit, Kitchen — diverse datasets revealing algorithm deficiencies.

8 chaptersFu et al.

Spatial Computing20

10 chaptersApple · Mescheder et al.

SHARP: Single-Image View Synthesis in <1s

Feedforward single-image to 1.2M 3D Gaussians. Depth Pro backbone, two-layer depth, self-supervised finetuning. 1000x faster than diffusion.

ICLR 2026

Theory of Space

Can foundation models build spatial beliefs through active exploration? Cognitive map probing reveals belief instability and inertia.

10 chaptersZhang, Huang, Wang et al.

1803.11288 · 2018

FutureMapping: Spatial AI Systems

The computational structure of Spatial AI — co-designing SLAM, deep learning, and specialized processors for always-on spatial intelligence.

10 chaptersDavison

1910.14139 · 2019

FutureMapping 2: Gaussian Belief Propagation

Local message passing on factor graphs for distributed Spatial AI — matching algorithms to graph processors.

10 chaptersDavison & Ortiz

2401.12168 · 2024

SpatialVLM: Spatial Reasoning for VLMs

Endowing vision-language models with quantitative spatial reasoning via synthetic 3D training data.

10 chaptersChen, Xu et al.

2604.14144 · 2026

SpatialEvo: Self-Evolving Spatial Intelligence

Deterministic geometric environments as zero-noise oracles for self-evolving 3D spatial reasoning.

10 chaptersLazarow, Kang & Dehghan

2604.05212 · 2026

Boxer: 2D→3D Bounding Box Lifting

Robustly lifting open-world 2D detections into metric 3D bounding boxes with uncertainty.

10 chaptersDeTone et al.

2505.23756 · 2025

Rooms from Motion

Object-centric localization and mapping — replacing points with 3D boxes for un-posed indoor SfM.

2412.04458 · 2024

Cubify Anything (CuTR + CA-1M)

Scaling indoor 3D detection: 440K exhaustive 3D boxes + a fully transformer detector that beats point methods.

10 chaptersLazarow et al.

2306.15667 · 2023

PoseDiffusion

Camera pose estimation as diffusion denoising with geometry-guided epipolar sampling.

2312.04563 · 2023

VGGSfM

Fully differentiable Structure from Motion — deep tracking, joint cameras, differentiable BA.

10 chaptersHu, Yin et al.

2307.10984 · 2023

Metric3D

Zero-shot metric depth from a single image via canonical camera space transformation.

10 chaptersYin et al.

2404.15506 · 2024

Metric3D v2

Geometric foundation model for joint zero-shot metric depth and surface normal estimation.

2403.12013 · 2024

GeoWizard

Unleashing Stable Diffusion priors for joint depth and normal estimation from single images.

10 chaptersFu, Yin et al.

2307.07635 · 2023

CoTracker

Joint point tracking in video — exploiting inter-track correlations with transformer attention.

10 chaptersKaraev et al.

2410.11831 · 2024

CoTracker3

Pseudo-labeling real videos with cycle consistency for better point tracking.

10 chaptersKaraev et al.

2604.14141 · 2026

LingBot-Map

Streaming 3D reconstruction with geometric context attention — anchor, window, and trajectory memory.

10 chaptersChen et al.

2601.18692 · 2026

LingBot-VLA

Pragmatic VLA foundation model — 20K hours of real dual-arm data, scaling study, GM-100 benchmark.

10 chaptersWu et al.

LingBot-Depth

Masked depth modeling — treating RGB-D sensor failures as natural masks for self-supervised depth completion.

10 chaptersTan et al.

2512.14692 · 2025

TRELLIS.2: Native 3D Latents

O-Voxel sparse representation + 16× compression VAE + 4B flow-matching model for high-quality 3D asset generation.

10 chaptersXiang et al.

World Models5

10 chaptersLi, Zhang et al.

LingBot-VA

Causal world modeling for robot control — autoregressive diffusion with MoT and closed-loop rollout.

2604.18486 · 2026

OneVL: One-Step Latent Reasoning

First latent CoT to beat explicit CoT — dual-modal decoders (language + visual world model) compress reasoning, then prefill matches answer-only speed.

10 chaptersXiaomi EI Team

2604.06425 · 2026

Neural Computers

The neural network IS the computer — unifying compute, memory, and I/O in learned runtime state. Video models that roll out CLI and GUI frames.

10 chaptersZhuge, Zhao et al. (Meta AI)

10 chaptersZhang, Jiang et al. (JHU/PKU/Princeton/MIT/Harvard)

World-In-World: Closed-Loop World Models

First open benchmark evaluating world models by embodied task success, not visual quality. Controllability beats aesthetics.

10 chaptersNasiriany et al. (UT Austin + NVIDIA)

RoboCasa: Large-Scale Kitchen Simulation

120 kitchen scenes, 2500+ AI-generated objects, 100 LLM-designed tasks, MimicGen 80× data amplification. Clear sim-to-real scaling trend.

Agentic Systems19

2605.17172 · 2026

OpenJarvis: Personal AI, On Personal Devices

Decompose AI stacks into 5 typed primitives, optimize jointly via LLM-guided spec search. On-device specs within 3.2 pp of cloud at 800× lower cost.

10 chaptersSaad-Falcon, Narayan et al. (Stanford)

2604.05336 · 2026

TRACE: Capability-Targeted Agentic Training

Contrastive trajectory analysis identifies missing capabilities, synthesizes targeted training environments, trains per-capability LoRA adapters via GRPO. +14.1 on τ²-Bench.

10 chaptersKang, Suresh et al. (Stanford)

2605.18747 · 2026

Code as Agent Harness

Unified framework for executable, verifiable, stateful agent systems — code as the operational substrate for reasoning, acting, memory, and coordination.

11 chaptersNing, Tieu, Fu et al. (UIUC + Meta + Stanford)

2604.14228 · 2026

Dive into Claude Code

The design space of AI agent systems — permission, compaction, extensibility, delegation, persistence.

10 chaptersLiu et al.

2603.28052 · 2026

Meta-Harness

End-to-end optimization of model harnesses — a coding agent searches over harness code using full filesystem history.

10 chaptersLee et al.

2508.20033 · 2026

DeepScholar-Bench

Live benchmark for generative research synthesis — no system exceeds 31% geometric mean across knowledge, retrieval, and verifiability.

10 chaptersPatel, Arabzadeh et al.

2503.16416 · 2025

Agent Evaluation Survey

First comprehensive survey of LLM agent evaluation — planning, tool use, web/SWE agents, generalist benchmarks.

10 chaptersYehudai et al.

2411.00640 · 2024

Error Bars for Evals

Statistical rigor for LLM evaluation — clustered CIs, paired tests, power analysis.

10 chaptersMiller (Anthropic)

2509.06917 · 2025

Paper2Agent

Converting research papers into MCP-based AI agents — auto-generated tools, test-driven refinement.

10 chaptersMiao et al.

2506.07982 · 2025

τ²-Bench

Dual-control agent evaluation — both agent and user use tools in a shared Dec-POMDP environment.

10 chaptersBarres et al.

2410.02089 · 2025

RLEF

RL from execution feedback — teaching code LLMs to read compiler errors and fix iteratively.

10 chaptersGehring et al.

2310.04406 · 2024

LATS: Language Agent Tree Search

MCTS for LM agents — reasoning + acting + planning unified. 92.7% HumanEval.

10 chaptersZhou et al.

2311.05772 · 2024

ADAPT

As-needed recursive task decomposition — try first, decompose on failure. +28.3% ALFWorld.

10 chaptersPrasad et al.

2505.22954 · 2026

Darwin Gödel Machine

Self-improving agents — evolve agent code with empirical validation, quality-diversity archive.

10 chaptersZhang et al.

2408.06292 · 2024

The AI Scientist

Fully automated scientific discovery — ideate, code, experiment, write paper, review. <$15/paper.

10 chaptersLu et al. (Sakana)

2504.08066 · 2025

AI Scientist v2

Workshop-level discovery via agentic tree search — first AI-authored paper accepted at ICLR workshop.

10 chaptersYamada et al. (Sakana)

10 chaptersNovikov et al. (DeepMind)

AlphaEvolve

Evolutionary coding agent — first Strassen improvement in 56 years, Google infrastructure optimization.

10 chaptersDu, Yan et al. (Shanghai AI Lab)

MLEvolve

Self-evolving multi-agent framework for automated ML discovery — graph search, retrospective memory, adaptive coding; SOTA on MLE-Bench in half the budget.

10 chaptersHe-Yueya, Singh et al. (Stanford)

GIANTS: Insight Anticipation

Predict a downstream paper’s core insight from its parent papers via RL with similarity rewards. 34% improvement over gemini-3-pro.

ICLR 2026 Outstanding

LLMs Get Lost In Multi-Turn Conversation

39% performance drop across 15 LLMs when tasks are delivered gradually across turns. Sharded simulation decomposes degradation into aptitude loss and reliability collapse.

10 chaptersLaban, Hayashi, Zhou & Neville

Inference-Time Compute12

10 chaptersGladstone et al.

Energy-Based Transformers

Predict by minimizing energy — 35% faster scaling than Transformer++, 29% better System 2 thinking from unsupervised learning alone.

2507.19457 · 2026

GEPA: Reflective Prompt Evolution

Natural language reflection on execution traces outperforms GRPO RL with 35× fewer rollouts.

10 chaptersAgrawal et al.

2504.12516 · 2025

BrowseComp

Benchmark for browsing agents — accuracy scales smoothly with test-time compute.

10 chaptersWei et al. (OpenAI)

2407.21787 · 2024

Large Language Monkeys

Inference scaling via repeated sampling — coverage follows exponentiated power law across 4 orders of magnitude.

10 chaptersBrown et al.

2409.15254 · 2025

ARCHON

Architecture search over inference-time technique combinations — outperforms o1 by 15.1%.

10 chaptersSaad-Falcon et al.

2408.03314 · 2024

Scaling Test-Time Compute

Optimal test-time compute beats 14× larger model. Search vs revision, difficulty-dependent allocation.

10 chaptersSnell et al.

2502.17578 · 2025

LLM Power Laws

Heavy-tailed p-distributions transform per-problem exponential scaling into aggregate power laws.

10 chaptersSchaeffer et al.

2506.18203 · 2025

Weaver: Weak Verifier Ensembles

Combine weak verifiers via weak supervision — Llama 70B + Weaver = o3-mini accuracy.

10 chaptersSaad-Falcon et al.

2305.20050 · 2023

Let's Verify Step by Step

Process reward models (PRM800K) outperform outcome supervision — 78% MATH with per-step verification.

10 chaptersLightman et al. (OpenAI)

2312.08935 · 2024

Math-Shepherd

Automatic process rewards via completion sampling — no human labels needed for step-level supervision.

10 chaptersInoue et al. (Sakana)

2503.04412 · 2025

AB-MCTS: Wider or Deeper?

Adaptive branching tree search — dynamically balance width (diversity) vs depth (refinement).

2501.05366 · 2025

Search-o1

Agentic search during reasoning — detect uncertainty, retrieve, reason-in-documents, inject into CoT.

10 chaptersBai et al. (Anthropic)

Training & RL for LLMs10

2507.20534 · 2025

Kimi K2

1T-param MoE with MuonClip optimizer and agentic RL — SOTA open-source on SWE-Bench and agentic tasks.

10 chaptersKimi Team

2412.19437 · 2024

DeepSeek-V3

671B MoE — MLA (57× KV compression), aux-loss-free balancing, FP8 training, $5.57M total cost.

10 chaptersDeepSeek-AI

2212.08073 · 2022

Constitutional AI

RLAIF — harmlessness from AI feedback via written principles. Critique, revise, train.

2203.14465 · 2022

STaR: Self-Taught Reasoner

Bootstrapping reasoning — generate rationales, filter correct, rationalize wrong, fine-tune, repeat.

10 chaptersZelikman et al.

2402.03300 · 2024

DeepSeekMath

GRPO — group-relative advantages without critic. 51.7% MATH from 7B model with 120B math tokens.

10 chaptersShao et al.

2503.14476 · 2025

DAPO

Open-source LLM RL at scale — decoupled clipping, dynamic sampling. 50 on AIME with Qwen2.5-32B.

10 chaptersByteDance Seed

2504.04736 · 2025

SWiRL: Step-Wise RL

Multi-step RL for reasoning + tool use — decompose trajectories into sub-trajectories, DPO on each.

10 chaptersGoldie et al.

2603.29791 · 2026

Simula: Reasoning-Driven Synthetic Data

Seedless agentic framework — taxonomy-driven coverage, double-critic quality, calibrated Elo complexity scoring at scale.

10 chaptersDavidson et al.

2604.17654 · 2026

Poly-EPO: Exploratory Reasoning

Set RL with polychromic objectives — train LLMs to explore diverse reasoning strategies via reward×diversity synergy. Up to 20% pass@k gains.

10 chaptersOrney, Hamid et al. (Stanford)

Code Generation3

2203.07814 · 2022

AlphaCode

Competition-level code generation — 1M samples, behavioral clustering, top 54.3% Codeforces.

10 chaptersLi et al. (DeepMind)

2501.14723 · 2025

CodeMonkeys

Serial × parallel test-time scaling for SWE-bench — 57.4% at $2,292 with amortized context.

10 chaptersEhrlich et al.

2502.10517 · 2025

KernelBench

Can LLMs write efficient GPU kernels? 250 PyTorch workloads, frontier models <20% at fast1.0.

10 chaptersOuyang et al.

Action Recognition & Video Understanding5

2003.12039 · 2020

RAFT: Optical Flow

All-pairs correlation volume + iterative GRU refinement — the dominant optical flow method.

10 chaptersTeed & Deng

2202.07925 · 2022

ActionFormer

Temporal action detection via local self-attention on multiscale feature pyramid — no proposals needed.

10 chaptersZhang et al.

2303.16727 · 2023

VideoMAE V2

Scaling masked video autoencoders to 1B+ params with dual masking and progressive training.

2403.15377 · 2024

InternVideo2

6B video foundation model — three-stage training, SOTA on 60+ benchmarks.

2403.06977 · 2024

VideoMamba

State space models for video — linear complexity alternative to quadratic attention for long videos.

10 chSimonyan & Zisserman

Deep Learning for Computer Vision32

1409.1556 · 2014

VGGNet

Depth matters — stacking 3×3 convolutions to 16-19 layers for ImageNet SOTA.

1409.4842 · 2014

GoogLeNet / Inception

Multi-scale parallel convolutions in Inception modules — 22 layers, only 5M params.

10 chSzegedy et al.

1411.4038 · 2014

FCN: Fully Convolutional Networks

From classification to dense per-pixel prediction — the birth of deep semantic segmentation.

10 chLong et al.

1311.2524 · 2013

R-CNN

CNN features for object detection — selective search + AlexNet features + SVM classification.

10 chGirshick et al.

1504.08083 · 2015

Fast R-CNN

RoI pooling + multi-task loss — process the image once, 213× faster than R-CNN.

10 chGirshick

1506.01497 · 2015

Faster R-CNN

Region Proposal Networks with anchors — learned proposals replace selective search.

10 chRen et al.

1506.02640 · 2015

YOLO: You Only Look Once

Single-shot detection as regression — 45 fps real-time object detection.

10 chRedmon et al.

2005.12872 · 2020

DETR

End-to-end detection with transformers — Hungarian matching, no NMS, no anchors.

10 chCarion et al.

2104.14294 · 2021

DINO

Self-supervised ViTs with emergent segmentation via self-distillation.

10 chCaron et al.

2304.07193 · 2023

DINOv2

All-purpose visual features without supervision — rivaling CLIP without text.

10 chOquab et al.

2204.12484 · NeurIPS 2022

ViTPose

Plain ViT + two deconv layers = SOTA pose estimation. No fancy modules, no domain-specific tricks. Simplicity wins.

10 chXu et al.

2511.09554 · ICLR 2026

RF-DETR

Weight-sharing NAS for real-time detection — train once, search thousands of sub-architectures for free. First real-time detector above 60 AP on COCO.

10 chRobinson et al.

2308.04079 · 2023

3D Gaussian Splatting

Real-time radiance field rendering via differentiable Gaussian splatting.

10 chKerbl et al.

2212.09748 · 2023

DiT: Diffusion Transformers

Transformer backbone for diffusion with LLM-style scaling laws.

10 chPeebles & Xie

2311.15127 · 2023

Stable Video Diffusion

Three-stage video diffusion — data curation matters more than architecture.

10 chBlattmann et al.

2310.03744 · 2023

LLaVA-1.5

Simple MLP connector + good data beats complex VLM engineering.

10 chLiu et al.

2312.14238 · CVPR 2024

InternVL

6B vision encoder with progressive LLM alignment for universal multimodal tasks.

10 chChen et al.

2311.16502 · CVPR 2024

MMMU Benchmark

Expert-level multimodal benchmark — GPT-4V gets 56.8%, humans get 88.6%.

10 chYue et al.

2312.14132 · CVPR 2024

DUSt3R

Geometric 3D vision made easy — predict pointmaps directly from image pairs.

10 chWang et al.

2401.10891 · 2024

Depth Anything

Monocular depth foundation model via self-training on 62M unlabeled images.

10 chYang et al.

2402.15391 · ICML 2024

Genie

Generative interactive environments from unlabeled video — ICML Best Paper.

10 chBruce et al.

2403.03206 · 2024

Stable Diffusion 3

Rectified flow + MMDiT for scalable text-to-image synthesis.

10 chEsser et al.

2404.02905 · NeurIPS 2024

VAR: Visual Autoregressive Modeling

Next-scale prediction — autoregressive image generation with LLM-style scaling. Best Paper.

10 chTian et al.

2406.09414 · 2024

Depth Anything V2

Synthetic precision + real diversity for state-of-the-art monocular depth.

10 chYang et al.

2406.09756 · 2024

MASt3R

Joint 3D reconstruction + dense matching — grounding image matching in 3D.

10 chLeroy et al.

2408.00714 · 2024

SAM 2

Segment anything in images AND videos — streaming memory attention.

10 chRavi et al.

2408.06072 · 2024

CogVideoX

Expert transformer for video generation with 3D VAE compression.

10 chYang et al.

2409.12191 · 2024

Qwen2-VL

Native dynamic resolution VLM with 2D-RoPE — any image, any aspect ratio.

10 chWang et al.

2412.12392 · CVPR 2025

MASt3R-SLAM

Real-time dense SLAM powered by learned 3D reconstruction priors.

10 chMurai et al.

2503.11651 · CVPR 2025

VGGT

One transformer, one pass — all 3D geometry from unposed images. CVPR Best Paper.

10 chWang et al.

2601.05246 · ICLR 2026 Oral

Depth Anything 3

One plain transformer recovers the visual space from any views — depth + rays, no architectural specialization.

10 chLin, Chen, Liew et al.

CS231n · 2024

GAS: Semantic Mapping from Video

SLAM + detection + segmentation + projection — monocular video to semantic floorplans.

10 chKuznetsov, Bhutra, Pal

2025 (Reimagined)

GAS v2: Modern Semantic Mapping

Rebuilding the semantic mapping pipeline with 2025 foundation models — zero fine-tuning, open vocabulary, no calibration.

12 chKuznetsov, Bhutra, Pal

2511.16719 · 2025

SAM 3: Segment Anything with Concepts

Unified detect + segment + track for open-vocabulary concepts — text or exemplar prompts find ALL instances. Doubles prior accuracy.

11 chCarion, Gustafson, Hu et al.

2604.20329 · 2026

Vision Banana: Generators are Vision Learners

Instruction-tune an image generator to output decodable RGB — beats SAM 3 on segmentation and Depth Anything 3 on depth with one model.

10 chGabeur, Long, Peng et al. (Google)

2604.21681 · ICLR 2026

Sapiens 2: Human-Centric Vision

0.4B–5B ViTs pretrained on 1B human images with unified MAE + contrastive learning. SOTA on pose, segmentation, normals, pointmaps, albedo.

10 chKhirodkar et al. (Meta RL)

Continual Learning3

2302.00487 · 2023

Continual Learning Survey

The full landscape of learning without forgetting — 7 scenarios, 5 method families, unified Bayesian framework.

10 chWang, Zhang, Su, Zhu

NeurIPS 2025

Nested Learning

The illusion of depth — neural nets as nested multi-level optimization with associative memories at every frequency.

10 chBehrouz, Razaviyayn, Zhong, Mirrokni

2504.17963 · IEEE SPM 2025

Mathematics of Continual Learning

Adaptive filtering IS continual learning — LMS, APA, RLS, and the Kalman filter reveal why modern CL methods work or fail.

10 chPeng, Vidal

Human-Computer Interaction3

2405.06061 · CHI 2025

GPTCoach: LLM-Based Physical Activity Coaching

Can an LLM implement evidence-based health coaching? Active Choices + motivational interviewing + wearable data via tool calling.

10 chaptersJörke et al.

2510.05449 · CHI 2026

Bloom: LLM-Augmented Behavior Change

LLMs as amplifiers of evidence-based HCI interactions — coaching chatbot + goal setting + ambient displays in a 4-week RCT.

10 chaptersJörke et al.

2510.14591 · CHI 2026

Just-In-Time Objectives

Infer the user’s goal from passive observation, then optimize LLM outputs for that singular objective — 66-86% win rates over generic LLMs.

10 chaptersLam et al.

Training & RL for LLMs2

ICLR 2026 Honorable Mention

Kimi K2: Open Agentic Intelligence

1T MoE with MuonClip optimizer (zero loss spikes) and agentic RL for SOTA tool use.

10 chKimi Team

The Polar Express: Optimal Muon

Optimal polynomial approximations for polar decomposition — provably best convergence for the Muon optimizer.

10 chAmsel, Persson, Musco & Gower

Interpretability3

Transformer Circuits 2026

A Global Workspace in Language Models

The Jacobian lens reads the unspoken words a model is poised to say — the privileged sliver it reports, steers, and silently reasons with.

12 chaptersGurnee, Sofroniew, Lindsey et al.

10 chaptersBhalla, Oesterling et al.

Temporal Sparse Autoencoders

One contrastive loss term disentangles semantic from syntactic features in SAEs — leveraging the sequential nature of language.

10 chaptersSimon, Kunin, Atanasov et al.

There Will Be a Scientific Theory of Deep Learning

Learning mechanics: the emerging science of neural network training dynamics — solvable settings, scaling laws, µP, universality.

Quantization & Compression5

2208.07339 · 2022

LLM.int8(): 8-bit Matrix Multiplication

Mixed-precision decomposition — isolate outlier features in fp16, quantize the rest to int8. Zero degradation at 175B scale.

10 chaptersDettmers et al.

2307.13304 · 2023

QuiP: 2-Bit Quantization with Guarantees

Incoherence processing via random orthogonal transforms enables provably optimal 2-bit weight quantization.

10 chaptersChee et al.

2402.04396 · 2024

QuiP#: Hadamard Incoherence & Lattice Codebooks

Randomized Hadamard transforms for incoherence plus E8 lattice vector quantization — sub-2-bit LLMs with near-lossless quality.

10 chaptersTseng et al.

QTIP: Trellis Codes for Quantization

Trellis-coded quantization with Viterbi decoding — exploiting sequential weight dependencies for extreme compression.

10 chaptersTseng et al.

10 chaptersEgiazarian et al.

AQLM: Additive Quantization for LLMs

Multi-codebook additive quantization — represent each weight group as a sum of codebook vectors for extreme 2-bit compression.

Attention & GPU Kernels2

ThunderKittens: Simple, Fast GPU Kernels

Tile-level DSL for GPU programming — small warp-group tiles as first-class abstractions for near-peak hardware utilization.

10 chaptersStanford

FlashAttention-4: Attention on Blackwell

Blackwell-native attention — WGMMA pipelining, TMA, and asynchronous warp specialization for 1.8 PFLOPs attention throughput.

10 chaptersDao et al.

KV Cache & Inference Efficiency1

SnapKV: LLM Knows What You Need

Observation-driven KV cache compression — cluster attention patterns to keep only the tokens that matter, 3.6× longer context for free.

LLM-Guided Search & AI for Science2

FunSearch for Competitive Programming

LLM-guided evolutionary search over program space — discovering novel algorithms for cap-set and bin-packing beyond human solutions.

10 chaptersDeepMind

Barbarians at the Gate: AI-Driven Systems Research

LLM agents as systems researchers — automated kernel optimization, data structure design, and congestion control that rival expert-crafted solutions.

10 chaptersUC Berkeley

NLP with Deep Learning53

2013

Word2Vec

CBOW and Skip-gram: learning word vectors from context windows.

10 chMikolov et al.

NeurIPS 2013

Word2Vec: Negative Sampling

NCE, subsampling frequent words, phrase detection.

10 chMikolov et al.

EMNLP 2014

GloVe

Co-occurrence matrix + log-bilinear model. Count meets prediction.

10 chPennington et al.

ACL 2015

It's the Tuning, Not the Model

PPMI+SVD matches Skip-gram when hyperparameters align.

8 chLevy et al.

EMNLP 2015

Evaluating Word Embeddings

Intrinsic vs extrinsic evaluation. Word intrusion coherence test.

8 chSchnabel et al.

TACL 2016

Why Word2Vec Works

Random walk model explains PMI, analogies, and low dimensionality.

8 chArora et al.

TACL 2018

Polysemy as Superposition

Polysemous word vectors are linear combinations of sense vectors.

8 chArora et al.

NeurIPS 2018

Optimal Embedding Dimensions

Bias-variance tradeoff for embedding dimension. PIP loss framework.

8 chYin & Shen

Nature 1986

Backpropagation

The paper that made backprop famous. Generalized delta rule.

10 chRumelhart et al.

Yes You Should Understand Backprop

Sigmoid saturation, dead ReLUs, gradient checking, practical debugging.

8 chKarpathy

JMLR 2011

NLP (Almost) from Scratch

Unified neural architecture for POS, NER, chunking, SRL.

8 chCollobert et al.

IEEE 1994

Vanishing Gradients in RNNs

Formal proof: gradient decays as O(λ^t). Why RNNs forget.

8 chBengio et al.

ICML 2013

Difficulty Training RNNs

Gradient clipping, the cliff phenomenon, practical training recipes.

8 chPascanu et al.

NeurIPS 2017

Attention Is All You Need

The Transformer. Self-attention, multi-head, positional encoding.

10 chVaswani et al.

Layer Normalization

Feature-wise normalization. Pre-LN vs Post-LN. RMSNorm.

8 chBa et al.

Image Transformer

Self-attention for autoregressive image generation. Local attention.

8 chParmar et al.

Music Transformer

Relative position for music. The skewing trick. Long-range structure.

8 chHuang et al.

NAACL 2019

BERT

Bidirectional pre-training via masked LM and next sentence prediction.

10 chDevlin et al.

Contextual Word Representations

From static to contextual: ELMo, GPT, BERT survey.

8 chSmith

Llama 3

Open-weight 8B-405B. GQA, SwiGLU, RoPE, 15T tokens.

10 chMeta AI

Scaling Instruction Tuning

Flan-PaLM/T5: more diverse tasks = better zero/few-shot.

8 chChung et al.

NeurIPS 2023

AlpacaFarm

Simulated human feedback for RLHF research at 1/100th cost.

8 chDubois et al.

NeurIPS 2023

How Far Can Camels Go

Open instruction tuning: data quality > quantity.

8 chWang et al.

GPT-3

175B params. Few-shot in-context learning without gradient updates.

10 chBrown et al.

NeurIPS 2022

Chain-of-Thought

"Let's think step by step." 18% → 79% on math tasks.

8 chWei et al.

Lottery Ticket Hypothesis

Sparse subnetworks match dense performance from original init.

8 chFrankle & Carlin

ICLR 2022

LoRA

Low-rank adaptation: ΔW = BA. 10,000x fewer trainable params.

8 chHu et al.

ICML 2019

Adapter Modules

Bottleneck adapters between frozen layers. ~3% params per task.

8 chHoulsby et al.

RAG

Retrieval-augmented generation. DPR + BART for knowledge-intensive NLP.

8 chLewis et al.

Toolformer

Self-supervised tool learning. LLM teaches itself to call APIs.

8 chSchick et al.

ICLR 2021

MMLU

57-subject multiple-choice benchmark. Knowledge breadth test.

8 chHendrycks et al.

HELM

Holistic evaluation: 42 scenarios, 7 metrics. Beyond accuracy.

8 chLiang et al.

ICLR 2023

Self-Consistency

Sample multiple CoT paths, majority vote. Diversity beats single-shot.

8 chWang et al.

DeepSeek-R1

RL-only training produces emergent reasoning. GRPO algorithm.

8 chDeepSeek

DAPO

Decoupled Advantage Policy Optimization. Scales RL to 32B.

8 chByteDance

Let's Verify Step by Step

Process reward models score each reasoning step, not just final answer.

8 chLightman et al.

ICML 2023

Speculative Decoding

Draft model + verify in parallel. Lossless 2-3x speedup.

8 chLeviathan et al.

Scaling Test-Time Compute

More inference compute can beat a larger model. Compute-optimal frontier.

8 chSnell et al.

2021

RoPE

Rotary position embedding. Position as rotation in 2D subspaces.

8 chSu et al.

ACL 2016

BPE for Neural MT

Byte Pair Encoding for subword segmentation. Handles OOV via decomposition.

8 chSennrich et al.

ACL 2020

XLM-RoBERTa

Cross-lingual masked LM on 100 languages. Zero-shot transfer.

8 chConneau et al.

EMNLP 2023

Tokenization Cost Across Languages

English-centric tokenizers make API costs 2-15x higher for other languages.

8 chAhia et al.

Agentic Interpretability

LLM agents automate interpretability research at scale.

8 chKim et al.

PNAS 2024

Concept Discovery in AlphaZero

Extract human-understandable chess concepts that transfer to human players.

8 chSchut et al.

AI Needs New Vocabulary

Human concepts are insufficient to describe AI representations.

8 chKim et al.

Neologism Learning

Train models to create new words for internal concepts.

8 chKim et al.

Chameleon

Mixed-modal early fusion. All modalities as unified tokens.

8 chMeta AI

Transfusion

AR text + diffusion images in one model. Hybrid objectives.

8 chMeta AI

Mixture-of-Transformers

Modality-specific experts with shared attention. Sparse routing.

8 chLiang et al.

Scaling Laws for Mixed-Modal

Power law scaling for text+image models. Optimal mixing ratios.

8 chAghajanyan et al.

CM3Leon

Autoregressive multimodal with retrieval augmentation.

8 chYu et al.

RA-CM3

Retrieval-augmented multimodal language modeling.

8 chYasunaga et al.

LMFusion

Adapting text-only LLMs for multimodal generation.

8 chShi et al.

OneFlow

Concurrent mixed-modal generation with edit flows.

8 chTay et al.

Multimodal RewardBench

Evaluating reward models for vision-language models.

8 chChen et al.

Reconstruction Alignment

Reconstruction objectives improve multimodal model alignment.

8 chLu et al.

Graph ML Papers31

KDD 2014

DeepWalk

Random walks as graph "sentences" + Word2Vec = node embeddings.

8 chPerozzi et al.

WWW 2015

LINE

First + second-order proximity. Edge sampling for scalability.

8 chTang et al.

KDD 2016

node2vec

Biased walks with p,q: BFS (homophily) vs DFS (structural).

10 chGrover & Leskovec

ICLR 2017

GCN

Semi-supervised classification with spectral graph convolutions.

10 chKipf & Welling

NeurIPS 2017

GraphSAGE

Sample + aggregate. Inductive. Mean/pool/LSTM aggregators.

8 chHamilton et al.

ICLR 2018

GAT

Learned attention weights. Multi-head. Inductive on PPI.

10 chVelickovic et al.

ESWC 2018

R-GCN

Relation-specific weights for heterogeneous graphs + basis decomposition.

8 chSchlichtkrull et al.

KDD 2018

PinSage

3B nodes, 18B edges. Random-walk neighborhoods at Pinterest scale.

10 chYing et al.

JK-Net

Jumping knowledge: concatenate ALL layer embeddings, then aggregate.

8 chXu et al.

GIN

Maximum expressiveness = WL test. Sum + MLP is injective.

10 chXu et al.

ICML 2019

SGC

Collapse GCN layers into one linear operation. Simpler, competitive.

8 chWu et al.

Design Space of GNNs

315K configurations tested. Best GNN depends on the task.

WWW 2020

HGT

Heterogeneous Graph Transformer: type-specific Q/K/V projections.

8 chHu et al.

AAAI 2021

ID-GNN

Identity-aware coloring via ego-network extraction. Beyond GIN.

Distance Encoding

Augment nodes with distances to targets. Provably beyond 1-WL.

8 chLi et al.

NeurIPS 2021

Graphormer

Centrality + spatial + edge encodings. Won OGB-LSC 2021.

10 chYing et al.

NeurIPS 2022

GPS

MPNN + Transformer + PE hybrid. General, powerful, scalable.

8 chRampasek et al.

NeurIPS 2013

TransE

h + r ≈ t. Relations as translations. Simple, elegant, limited.

8 chBordes et al.

ICLR 2015

DistMult

Bilinear KG scoring. Symmetric by construction.

8 chYang et al.

ICML 2016

ComplEx

Complex embeddings break symmetry via conjugation.

8 chTrouillon et al.

RotatE

Relations as rotations in complex space. All relation patterns.

8 chSun et al.

SIGIR 2019

NGCF

Neural graph collaborative filtering on bipartite interaction graphs.

8 chWang et al.

SIGIR 2020

LightGCN

Simplified GCN for CF: no nonlinearities, no transforms. Better.

8 chHe et al.

Relational Deep Learning

Databases ARE graphs. GNNs replace feature engineering. RelBench.

8 chFey et al.

8 chSanchez-Gonzalez et al.

Polypharmacy Side Effects

Decagon: R-GCN predicts drug-drug interaction side effects.

8 chZitnik et al.

ICML 2020

Learning to Simulate Physics

Particles as nodes. GNN predicts next state. Generalizes across materials.

GraphRNN

Autoregressive graph generation. Two-level RNN: nodes then edges.

NeurIPS 2018

GCPN

RL-guided molecular generation. GCN policy + PPO optimization.

JT-VAE

Junction tree VAE: hierarchical molecule generation. Always valid.

8 chJin et al.

ReAct

Thought-Action-Observation loop for LLM agents.

8 chYao et al.

10 chaptersBergsträßer, Cotterell & Lin

Reflexion

Verbal self-reflection as reinforcement learning. No weight updates.

8 chShinn et al.

Foundational Papers8

2015

Knowledge Distillation

Soft targets, temperature scaling, dark knowledge from teacher networks.

9 chHinton et al.

2604.00626 · 2026

On-Policy Distillation Survey

Why students must practice on their own mistakes: f-divergence framework, exposure bias O(εT²)→O(εT), and the RL equivalence.

10 chSong & Zheng

2021

Vision Transformer (ViT)

An image is worth 16x16 words — patches as tokens.

9 chDosovitskiy et al.

ICLR 2026 Outstanding

Transformers are Inherently Succinct

Transformers encode concepts exponentially more compactly than RNNs and doubly exponentially more than automata — making verification EXPSPACE-complete.