Research papers rebuilt as interactive experiences. Every equation derived, every method visualized, every result interactive.
The foundation model — flow matching for continuous robot actions across 7 robot types.
First VLA to clean kitchens in new homes via heterogeneous co-training.
Train fast, run fast, generalize better — decoupling VLM knowledge from action learning.
The first VLA that improves from its own real-world experience via RL.
Efficient real-time chunking via training-time action conditioning.
Making action chunking flow policies run in real-time.
5-10x fewer action tokens via DCT-based compression.
Giving VLAs memory across multiple time scales for long-horizon tasks.
How internet video of humans performing tasks transfers to robot control.
Fine-tuning VLAs with RL by treating reward as a special token.
Diversified prompts unlock emergent dexterity, cross-embodiment transfer, and compositional generalization.
TRI’s open-source framework unifying the full pipeline. Qwen3-VL backbone beats LBM by 23pp on bimanual manipulation.
Predict robot actions as plain text — no new tokens, no action heads, no architectural changes. Surprisingly SOTA on LIBERO.
Three desiderata for action tokens: compression, decodability, ordering. Nested dropout + FSQ = anytime coarse-to-fine robot actions.
Learning as data compression. Kolmogorov complexity, crude vs refined MDL.
Read-Process-Write for set inputs, ordering effects on chain rule.
Micro-batch splitting, bubble time analysis, scaling to 557M params.
Skip connections, degradation problem, 152-layer networks.
Exponential receptive fields without pooling for dense prediction.
MPNN unifying framework for graph neural networks.
THE Transformer paper — self-attention, multi-head, positional encoding.
Original attention mechanism for neural machine translation.
Pre-activation BN-ReLU, 1001-layer networks.
Pairwise relational reasoning for visual QA.
VAE posterior collapse, bits-back coding, information flow control.
Attention over memory slots, memory-memory interactions.
Rise and fall of complexity in closed systems.
External memory, differentiable addressing, copy/sort tasks.
End-to-end speech recognition, CTC loss, English + Mandarin.
Power law scaling over 7 orders of magnitude.
Visuomotor policy learning via action diffusion — 46.9% average improvement across 15 tasks.
Low-cost bimanual manipulation via CVAE-based action chunking — 80-90% on fine tasks from 10 min of demos.
The foundational policy gradient — statistical gradient-following via the log-derivative trick.
The theorem that enabled actor-critic methods — policy gradients with function approximation.
The workhorse of modern RL — clipped surrogate objective for stable, efficient policy learning.
The exponentially-weighted advantage estimator that controls bias-variance tradeoff — used by PPO and every modern actor-critic.
CNN + Q-learning + experience replay — the paper that launched deep RL from raw pixels.
One-line fix to Q-learning's overestimation bias — decouple selection from evaluation.
Learn the full return distribution, not just the mean — Wasserstein contraction and 51-atom categories.
Offline RL without querying OOD actions — expectile regression on the value function for implicit policy improvement.
Learn reward from pairwise human comparisons — the paper that launched RLHF and made ChatGPT possible.
Your language model is secretly a reward model — skip the RL loop and align with a simple classification loss.
DPO projects rewards onto a low-dimensional manifold — causing preference reversal and reward degradation. AuxDPO fixes it with nullspace degrees of freedom.
Your language model is secretly a preference classifier — extract P("Yes")/P("No") to self-judge and self-improve via DPO.
Short model rollouts branched from real data — 10x sample efficiency of model-free methods with matching asymptotic performance.
Failed episode? Pretend the achieved state was the goal — sparse rewards solved via implicit curriculum.
Learn the RL algorithm itself — encode it in an RNN's weights via meta-learning across tasks.
Break meta-RL's chicken-and-egg problem by separating exploration and exploitation objectives.
LLMs know what to do, robots know what they can do — multiply the two for grounded planning.
Domain randomization + accurate actuator models enable RL policies to transfer from sim to real robots.
In-the-wild robot teaching without in-the-wild robots — handheld gripper demos transfer to any robot via relative actions.
The bottleneck isn't better models — it's better data. A unified framework for exploration across SL and RL, and the path to open-ended learning.
The benchmark that shaped offline RL. Maze2D, AntMaze, Adroit, Kitchen — diverse datasets revealing algorithm deficiencies.
Feedforward single-image to 1.2M 3D Gaussians. Depth Pro backbone, two-layer depth, self-supervised finetuning. 1000x faster than diffusion.
Can foundation models build spatial beliefs through active exploration? Cognitive map probing reveals belief instability and inertia.
The computational structure of Spatial AI — co-designing SLAM, deep learning, and specialized processors for always-on spatial intelligence.
Local message passing on factor graphs for distributed Spatial AI — matching algorithms to graph processors.
Endowing vision-language models with quantitative spatial reasoning via synthetic 3D training data.
Deterministic geometric environments as zero-noise oracles for self-evolving 3D spatial reasoning.
Robustly lifting open-world 2D detections into metric 3D bounding boxes with uncertainty.
Object-centric localization and mapping — replacing points with 3D boxes for un-posed indoor SfM.
Scaling indoor 3D detection: 440K exhaustive 3D boxes + a fully transformer detector that beats point methods.
Camera pose estimation as diffusion denoising with geometry-guided epipolar sampling.
Fully differentiable Structure from Motion — deep tracking, joint cameras, differentiable BA.
Zero-shot metric depth from a single image via canonical camera space transformation.
Geometric foundation model for joint zero-shot metric depth and surface normal estimation.
Unleashing Stable Diffusion priors for joint depth and normal estimation from single images.
Joint point tracking in video — exploiting inter-track correlations with transformer attention.
Pseudo-labeling real videos with cycle consistency for better point tracking.
Streaming 3D reconstruction with geometric context attention — anchor, window, and trajectory memory.
Pragmatic VLA foundation model — 20K hours of real dual-arm data, scaling study, GM-100 benchmark.
Masked depth modeling — treating RGB-D sensor failures as natural masks for self-supervised depth completion.
O-Voxel sparse representation + 16× compression VAE + 4B flow-matching model for high-quality 3D asset generation.
Causal world modeling for robot control — autoregressive diffusion with MoT and closed-loop rollout.
First latent CoT to beat explicit CoT — dual-modal decoders (language + visual world model) compress reasoning, then prefill matches answer-only speed.
The neural network IS the computer — unifying compute, memory, and I/O in learned runtime state. Video models that roll out CLI and GUI frames.
First open benchmark evaluating world models by embodied task success, not visual quality. Controllability beats aesthetics.
120 kitchen scenes, 2500+ AI-generated objects, 100 LLM-designed tasks, MimicGen 80× data amplification. Clear sim-to-real scaling trend.
Decompose AI stacks into 5 typed primitives, optimize jointly via LLM-guided spec search. On-device specs within 3.2 pp of cloud at 800× lower cost.
Contrastive trajectory analysis identifies missing capabilities, synthesizes targeted training environments, trains per-capability LoRA adapters via GRPO. +14.1 on τ²-Bench.
Unified framework for executable, verifiable, stateful agent systems — code as the operational substrate for reasoning, acting, memory, and coordination.
The design space of AI agent systems — permission, compaction, extensibility, delegation, persistence.
End-to-end optimization of model harnesses — a coding agent searches over harness code using full filesystem history.
Live benchmark for generative research synthesis — no system exceeds 31% geometric mean across knowledge, retrieval, and verifiability.
First comprehensive survey of LLM agent evaluation — planning, tool use, web/SWE agents, generalist benchmarks.
Statistical rigor for LLM evaluation — clustered CIs, paired tests, power analysis.
Converting research papers into MCP-based AI agents — auto-generated tools, test-driven refinement.
Dual-control agent evaluation — both agent and user use tools in a shared Dec-POMDP environment.
RL from execution feedback — teaching code LLMs to read compiler errors and fix iteratively.
MCTS for LM agents — reasoning + acting + planning unified. 92.7% HumanEval.
As-needed recursive task decomposition — try first, decompose on failure. +28.3% ALFWorld.
Self-improving agents — evolve agent code with empirical validation, quality-diversity archive.
Fully automated scientific discovery — ideate, code, experiment, write paper, review. <$15/paper.
Workshop-level discovery via agentic tree search — first AI-authored paper accepted at ICLR workshop.
Evolutionary coding agent — first Strassen improvement in 56 years, Google infrastructure optimization.
Predict a downstream paper’s core insight from its parent papers via RL with similarity rewards. 34% improvement over gemini-3-pro.
39% performance drop across 15 LLMs when tasks are delivered gradually across turns. Sharded simulation decomposes degradation into aptitude loss and reliability collapse.
Predict by minimizing energy — 35% faster scaling than Transformer++, 29% better System 2 thinking from unsupervised learning alone.
Natural language reflection on execution traces outperforms GRPO RL with 35× fewer rollouts.
Benchmark for browsing agents — accuracy scales smoothly with test-time compute.
Inference scaling via repeated sampling — coverage follows exponentiated power law across 4 orders of magnitude.
Architecture search over inference-time technique combinations — outperforms o1 by 15.1%.
Optimal test-time compute beats 14× larger model. Search vs revision, difficulty-dependent allocation.
Heavy-tailed p-distributions transform per-problem exponential scaling into aggregate power laws.
Combine weak verifiers via weak supervision — Llama 70B + Weaver = o3-mini accuracy.
Process reward models (PRM800K) outperform outcome supervision — 78% MATH with per-step verification.
Automatic process rewards via completion sampling — no human labels needed for step-level supervision.
Adaptive branching tree search — dynamically balance width (diversity) vs depth (refinement).
Agentic search during reasoning — detect uncertainty, retrieve, reason-in-documents, inject into CoT.
1T-param MoE with MuonClip optimizer and agentic RL — SOTA open-source on SWE-Bench and agentic tasks.
671B MoE — MLA (57× KV compression), aux-loss-free balancing, FP8 training, $5.57M total cost.
RLAIF — harmlessness from AI feedback via written principles. Critique, revise, train.
Bootstrapping reasoning — generate rationales, filter correct, rationalize wrong, fine-tune, repeat.
GRPO — group-relative advantages without critic. 51.7% MATH from 7B model with 120B math tokens.
Open-source LLM RL at scale — decoupled clipping, dynamic sampling. 50 on AIME with Qwen2.5-32B.
Multi-step RL for reasoning + tool use — decompose trajectories into sub-trajectories, DPO on each.
Seedless agentic framework — taxonomy-driven coverage, double-critic quality, calibrated Elo complexity scoring at scale.
Set RL with polychromic objectives — train LLMs to explore diverse reasoning strategies via reward×diversity synergy. Up to 20% pass@k gains.
Competition-level code generation — 1M samples, behavioral clustering, top 54.3% Codeforces.
Serial × parallel test-time scaling for SWE-bench — 57.4% at $2,292 with amortized context.
Can LLMs write efficient GPU kernels? 250 PyTorch workloads, frontier models <20% at fast1.0.
All-pairs correlation volume + iterative GRU refinement — the dominant optical flow method.
Temporal action detection via local self-attention on multiscale feature pyramid — no proposals needed.
Scaling masked video autoencoders to 1B+ params with dual masking and progressive training.
6B video foundation model — three-stage training, SOTA on 60+ benchmarks.
State space models for video — linear complexity alternative to quadratic attention for long videos.
Depth matters — stacking 3×3 convolutions to 16-19 layers for ImageNet SOTA.
Multi-scale parallel convolutions in Inception modules — 22 layers, only 5M params.
From classification to dense per-pixel prediction — the birth of deep semantic segmentation.
CNN features for object detection — selective search + AlexNet features + SVM classification.
RoI pooling + multi-task loss — process the image once, 213× faster than R-CNN.
Region Proposal Networks with anchors — learned proposals replace selective search.
Single-shot detection as regression — 45 fps real-time object detection.
End-to-end detection with transformers — Hungarian matching, no NMS, no anchors.
Self-supervised ViTs with emergent segmentation via self-distillation.
All-purpose visual features without supervision — rivaling CLIP without text.
Plain ViT + two deconv layers = SOTA pose estimation. No fancy modules, no domain-specific tricks. Simplicity wins.
Weight-sharing NAS for real-time detection — train once, search thousands of sub-architectures for free. First real-time detector above 60 AP on COCO.
Real-time radiance field rendering via differentiable Gaussian splatting.
Transformer backbone for diffusion with LLM-style scaling laws.
Three-stage video diffusion — data curation matters more than architecture.
Simple MLP connector + good data beats complex VLM engineering.
6B vision encoder with progressive LLM alignment for universal multimodal tasks.
Expert-level multimodal benchmark — GPT-4V gets 56.8%, humans get 88.6%.
Geometric 3D vision made easy — predict pointmaps directly from image pairs.
Monocular depth foundation model via self-training on 62M unlabeled images.
Generative interactive environments from unlabeled video — ICML Best Paper.
Rectified flow + MMDiT for scalable text-to-image synthesis.
Next-scale prediction — autoregressive image generation with LLM-style scaling. Best Paper.
Synthetic precision + real diversity for state-of-the-art monocular depth.
Joint 3D reconstruction + dense matching — grounding image matching in 3D.
Segment anything in images AND videos — streaming memory attention.
Expert transformer for video generation with 3D VAE compression.
Native dynamic resolution VLM with 2D-RoPE — any image, any aspect ratio.
Real-time dense SLAM powered by learned 3D reconstruction priors.
One transformer, one pass — all 3D geometry from unposed images. CVPR Best Paper.
One plain transformer recovers the visual space from any views — depth + rays, no architectural specialization.
SLAM + detection + segmentation + projection — monocular video to semantic floorplans.
Rebuilding the semantic mapping pipeline with 2025 foundation models — zero fine-tuning, open vocabulary, no calibration.
Unified detect + segment + track for open-vocabulary concepts — text or exemplar prompts find ALL instances. Doubles prior accuracy.
Instruction-tune an image generator to output decodable RGB — beats SAM 3 on segmentation and Depth Anything 3 on depth with one model.
0.4B–5B ViTs pretrained on 1B human images with unified MAE + contrastive learning. SOTA on pose, segmentation, normals, pointmaps, albedo.
The full landscape of learning without forgetting — 7 scenarios, 5 method families, unified Bayesian framework.
The illusion of depth — neural nets as nested multi-level optimization with associative memories at every frequency.
Adaptive filtering IS continual learning — LMS, APA, RLS, and the Kalman filter reveal why modern CL methods work or fail.
Can an LLM implement evidence-based health coaching? Active Choices + motivational interviewing + wearable data via tool calling.
LLMs as amplifiers of evidence-based HCI interactions — coaching chatbot + goal setting + ambient displays in a 4-week RCT.
Infer the user’s goal from passive observation, then optimize LLM outputs for that singular objective — 66-86% win rates over generic LLMs.
One contrastive loss term disentangles semantic from syntactic features in SAEs — leveraging the sequential nature of language.
Learning mechanics: the emerging science of neural network training dynamics — solvable settings, scaling laws, µP, universality.
Mixed-precision decomposition — isolate outlier features in fp16, quantize the rest to int8. Zero degradation at 175B scale.
Incoherence processing via random orthogonal transforms enables provably optimal 2-bit weight quantization.
Randomized Hadamard transforms for incoherence plus E8 lattice vector quantization — sub-2-bit LLMs with near-lossless quality.
Trellis-coded quantization with Viterbi decoding — exploiting sequential weight dependencies for extreme compression.
Multi-codebook additive quantization — represent each weight group as a sum of codebook vectors for extreme 2-bit compression.
Tile-level DSL for GPU programming — small warp-group tiles as first-class abstractions for near-peak hardware utilization.
Blackwell-native attention — WGMMA pipelining, TMA, and asynchronous warp specialization for 1.8 PFLOPs attention throughput.
LLM-guided evolutionary search over program space — discovering novel algorithms for cap-set and bin-packing beyond human solutions.
LLM agents as systems researchers — automated kernel optimization, data structure design, and congestion control that rival expert-crafted solutions.
CBOW and Skip-gram: learning word vectors from context windows.
NCE, subsampling frequent words, phrase detection.
Co-occurrence matrix + log-bilinear model. Count meets prediction.
PPMI+SVD matches Skip-gram when hyperparameters align.
Intrinsic vs extrinsic evaluation. Word intrusion coherence test.
Random walk model explains PMI, analogies, and low dimensionality.
Polysemous word vectors are linear combinations of sense vectors.
Bias-variance tradeoff for embedding dimension. PIP loss framework.
The paper that made backprop famous. Generalized delta rule.
Sigmoid saturation, dead ReLUs, gradient checking, practical debugging.
Unified neural architecture for POS, NER, chunking, SRL.
Formal proof: gradient decays as O(λt). Why RNNs forget.
Gradient clipping, the cliff phenomenon, practical training recipes.
The Transformer. Self-attention, multi-head, positional encoding.
Feature-wise normalization. Pre-LN vs Post-LN. RMSNorm.
Self-attention for autoregressive image generation. Local attention.
Relative position for music. The skewing trick. Long-range structure.
Bidirectional pre-training via masked LM and next sentence prediction.
From static to contextual: ELMo, GPT, BERT survey.
Open-weight 8B-405B. GQA, SwiGLU, RoPE, 15T tokens.
Flan-PaLM/T5: more diverse tasks = better zero/few-shot.
Simulated human feedback for RLHF research at 1/100th cost.
Open instruction tuning: data quality > quantity.
175B params. Few-shot in-context learning without gradient updates.
"Let's think step by step." 18% → 79% on math tasks.
Sparse subnetworks match dense performance from original init.
Low-rank adaptation: ΔW = BA. 10,000x fewer trainable params.
Bottleneck adapters between frozen layers. ~3% params per task.
Retrieval-augmented generation. DPR + BART for knowledge-intensive NLP.
Self-supervised tool learning. LLM teaches itself to call APIs.
57-subject multiple-choice benchmark. Knowledge breadth test.
Holistic evaluation: 42 scenarios, 7 metrics. Beyond accuracy.
Sample multiple CoT paths, majority vote. Diversity beats single-shot.
RL-only training produces emergent reasoning. GRPO algorithm.
Decoupled Advantage Policy Optimization. Scales RL to 32B.
Process reward models score each reasoning step, not just final answer.
Draft model + verify in parallel. Lossless 2-3x speedup.
More inference compute can beat a larger model. Compute-optimal frontier.
Rotary position embedding. Position as rotation in 2D subspaces.
Byte Pair Encoding for subword segmentation. Handles OOV via decomposition.
Cross-lingual masked LM on 100 languages. Zero-shot transfer.
English-centric tokenizers make API costs 2-15x higher for other languages.
LLM agents automate interpretability research at scale.
Extract human-understandable chess concepts that transfer to human players.
Human concepts are insufficient to describe AI representations.
Train models to create new words for internal concepts.
Mixed-modal early fusion. All modalities as unified tokens.
AR text + diffusion images in one model. Hybrid objectives.
Modality-specific experts with shared attention. Sparse routing.
Power law scaling for text+image models. Optimal mixing ratios.
Autoregressive multimodal with retrieval augmentation.
Retrieval-augmented multimodal language modeling.
Adapting text-only LLMs for multimodal generation.
Concurrent mixed-modal generation with edit flows.
Evaluating reward models for vision-language models.
Reconstruction objectives improve multimodal model alignment.
Random walks as graph "sentences" + Word2Vec = node embeddings.
First + second-order proximity. Edge sampling for scalability.
Biased walks with p,q: BFS (homophily) vs DFS (structural).
Semi-supervised classification with spectral graph convolutions.
Sample + aggregate. Inductive. Mean/pool/LSTM aggregators.
Learned attention weights. Multi-head. Inductive on PPI.
Relation-specific weights for heterogeneous graphs + basis decomposition.
3B nodes, 18B edges. Random-walk neighborhoods at Pinterest scale.
Jumping knowledge: concatenate ALL layer embeddings, then aggregate.
Maximum expressiveness = WL test. Sum + MLP is injective.
Collapse GCN layers into one linear operation. Simpler, competitive.
315K configurations tested. Best GNN depends on the task.
Heterogeneous Graph Transformer: type-specific Q/K/V projections.
Identity-aware coloring via ego-network extraction. Beyond GIN.
Augment nodes with distances to targets. Provably beyond 1-WL.
Centrality + spatial + edge encodings. Won OGB-LSC 2021.
MPNN + Transformer + PE hybrid. General, powerful, scalable.
h + r ≈ t. Relations as translations. Simple, elegant, limited.
Bilinear KG scoring. Symmetric by construction.
Complex embeddings break symmetry via conjugation.
Relations as rotations in complex space. All relation patterns.
Neural graph collaborative filtering on bipartite interaction graphs.
Simplified GCN for CF: no nonlinearities, no transforms. Better.
Databases ARE graphs. GNNs replace feature engineering. RelBench.
Decagon: R-GCN predicts drug-drug interaction side effects.
Particles as nodes. GNN predicts next state. Generalizes across materials.
Autoregressive graph generation. Two-level RNN: nodes then edges.
RL-guided molecular generation. GCN policy + PPO optimization.
Junction tree VAE: hierarchical molecule generation. Always valid.
Thought-Action-Observation loop for LLM agents.
Verbal self-reflection as reinforcement learning. No weight updates.
Soft targets, temperature scaling, dark knowledge from teacher networks.
Why students must practice on their own mistakes: f-divergence framework, exposure bias O(εT²)→O(εT), and the RL equivalence.
An image is worth 16x16 words — patches as tokens.
Transformers encode concepts exponentially more compactly than RNNs and doubly exponentially more than automata — making verification EXPSPACE-complete.
Exponential-trapezoidal discretization, complex-valued states, and MIMO — three SSM-principled upgrades that push the Pareto frontier of quality vs. inference speed.
K independent MemoryBlocks inject token identity at every layer, amplifying rare-token gradients K-fold and bypassing contextual collapse.
RL-trained skill curator learns when to insert, update, and delete reusable Markdown skills. 8B curator beats Gemini-2.5-Pro.
Power-law scaling for knowledge distillation — optimal teacher-student size ratios, capacity gaps, and when distillation beats pre-training compute-for-compute.