The frontier — what's solved, what's hard, what remains unknown.
You've spent 18 lectures building a mental model of NLP — from word vectors to transformers, from pretraining to agents, from evaluation to reasoning. Now step back and look at the whole landscape. What have we actually solved? What's still hard? What remains completely unknown?
The honest answer is humbling. Despite the breathless headlines, the fraction of "language understanding" that current systems have truly mastered is narrow. They can translate between languages, summarize documents, write code, and hold conversations. These are extraordinary feats — any one of them would have seemed impossible in 2015. But they come with caveats the headlines don't mention.
Models can write code but can't reliably plan a multi-step algorithm from scratch. They can summarize a paper but can't tell you whether the paper's claims are actually supported by its evidence. They can chat fluently but can't maintain a consistent belief system across a long conversation. The gap between "impressive demo" and "reliable system" is where the open questions live.
The timeline below maps the major milestones from 2017 to 2026. Click any milestone to see its impact. Notice the pattern: each breakthrough opens as many questions as it answers. The "unsolved" region on the right isn't shrinking as fast as you might expect.
Click any milestone to see details. The green region is "solved" capability; the amber region is "partially solved"; the red region is "unsolved." Drag the year slider to see how the frontier has shifted.
Each green milestone above represents a problem that moved from "impossible" to "routine." But look at the amber and red zones. Reliable reasoning, physical grounding, long-horizon planning, safety guarantees — these remain stubbornly open. They aren't engineering problems waiting for more data. They're conceptual problems that may require fundamentally new ideas.
The chapters ahead cover five open frontiers: reasoning and planning (Can models think?), grounding and embodiment (Can they connect language to the physical world?), efficiency and sustainability (Can we afford to keep scaling?), safety and alignment (Can we trust them?), and the multimodal frontier (Can they see, hear, and act?). Each is a career's worth of research. Together, they define NLP in 2026.
In Lecture 12 and 13, we saw chain-of-thought prompting and process reward models. They help LLMs reason step by step. But here's the dirty secret: even the best reasoning systems have a fundamental problem. Errors compound.
Think about solving a 10-step math problem. If each step has a 95% chance of being correct — which is impressively high — the probability of getting all 10 steps right is 0.9510 ≈ 0.60. A 40% failure rate on problems where every individual step is 95% reliable. Now imagine a 50-step plan for a software project. At 95% per step, you get 0.9550 ≈ 0.08. An 8% success rate. This is why LLMs can solve simple problems impressively but collapse on complex, multi-step tasks.
This isn't just a matter of "get better at each step." Even humans manage complex plans not by being perfect at each step, but by monitoring, backtracking, and re-planning. Current LLMs generate plans linearly — one step after another, left to right, with no ability to look ahead, backtrack, or verify intermediate states against reality.
The simulation below shows how step-level accuracy compounds over a multi-step plan. Drag the complexity slider to increase the number of steps. Watch how quickly the overall success probability drops, even with high per-step accuracy. The orange line shows what happens when you add error correction (detecting and fixing mistakes mid-plan) — it's dramatically better, but we don't yet know how to build reliable error correction into LLMs.
Each dot is one step. Green = correct, red = error. Overall success requires ALL steps correct. Drag complexity to see how plans break down. Toggle error correction to see the difference recovery makes.
Reasoning answers the question "given these premises, what follows?" It's about logical deduction — and LLMs have gotten surprisingly decent at it with chain-of-thought and tree search. Planning is different. It asks "given a goal, what sequence of actions gets me there?" Planning requires:
Current LLMs can do step 1 reasonably well. Steps 2-4 are where they fall apart. They have no persistent state tracker — they "remember" previous steps only through the context window, which is a flat sequence of tokens with no structured representation of what's been accomplished. And re-planning requires admitting the original plan failed, which autoregressive generation handles poorly — the model would need to "un-say" its earlier steps.
Formal verification of reasoning chains. Can we prove that a model's reasoning is correct, step by step, using formal methods? This is the dream: a model generates a proof, and a verifier checks every step with mathematical certainty. It works for mathematics (Lean, Coq) but nobody knows how to extend it to informal reasoning ("Will this marketing strategy work?").
World models for planning. Instead of planning in "token space," what if the model maintains an explicit state representation — a world model — and simulates the consequences of actions before committing? This is how humans plan: we imagine outcomes. OpenAI's o1 and o3 take steps in this direction, but their internal representations remain opaque.
Neurosymbolic integration. Combine the flexibility of neural networks with the reliability of symbolic systems. Let the LLM generate candidate plans, then use a symbolic planner to verify feasibility and constraint satisfaction. The challenge is building the interface: how do you translate between fuzzy neural representations and crisp symbolic ones?
Ask an LLM "What happens if you stack a bowling ball on top of an egg?" and it will tell you the egg breaks. It learned this from text. But it doesn't understand why the egg breaks — it has no concept of weight, fragility, or the mechanics of force transmission through contact surfaces. It matched the pattern "heavy thing on fragile thing = break." Change the scenario slightly — "What if the bowling ball is made of styrofoam?" — and the model may still say the egg breaks, because it has no physical intuition to fall back on.
This is the grounding problem: language models learn associations between words, but the words aren't connected to the physical world they describe. "Heavy" is just a token that co-occurs with "falls," "crushes," "weighs." It isn't linked to an experience of heaviness, a physics simulation, or even a database of material densities.
The grounding problem matters enormously for robotics and embodied AI. If you want a robot to follow the instruction "put the mug on the shelf," it needs to understand that mugs are graspable, shelves are horizontal surfaces, and you need to navigate to the shelf without knocking things over. A language model can parse the instruction perfectly and still produce a plan that ignores physics.
The simulation below demonstrates grounding failure. A model generates a plan to stack blocks. Toggle "grounded" mode on and off to see the difference between a plan that respects physics (blocks obey gravity, large blocks can't balance on small ones) and a plan that only respects linguistic plausibility (the model says "place block A on block B" without checking whether it's physically stable).
Click "Generate Plan" to see 5 block-stacking steps. In ungrounded mode, the plan looks linguistically correct but blocks topple. Toggle "Grounded" to add physics simulation.
Vision-language models (VLMs) connect language to visual perception. Models like GPT-4V and Gemini can look at an image and reason about it. But seeing a photo of blocks isn't the same as understanding the physics of blocks. The model sees pixels, not forces. It can identify "a block is on top of another block" but can't predict whether the configuration is stable without having learned that specific visual pattern.
Embodied simulation. Train models in physics simulators (MuJoCo, Isaac Sim, SAPIEN) where they experience gravity, friction, and collisions directly. The model learns that heavy objects fall, stacks topple, and fragile objects break — not from text descriptions, but from thousands of simulated experiments. The challenge: simulation-to-real transfer. Physics simulators are approximations, and models trained in simulation often fail when deployed on real robots because the real world has textures, lighting, compliance, and chaos that simulators don't capture.
Multimodal pretraining. Train on video paired with narration — YouTube cooking videos, assembly instructions, sports commentary. The model sees objects being manipulated while hearing language that describes the manipulation. This provides grounding through correlation: "pour" consistently co-occurs with liquid moving from one container to another. But correlation isn't causation, and the model still doesn't understand why pouring works (gravity, fluid dynamics, container geometry).
The deepest challenge isn't learning individual physical facts ("bowling balls are heavy") but composing them in novel situations. Humans can reason about scenarios they've never encountered by composing physical primitives: "A rubber band stretched between two nails will resist sideways force." We've never seen this exact scenario described in text, but we can compose elasticity + tension + geometry to predict the outcome.
Current models struggle with this. They can answer questions about scenarios they've seen in training data, but novel compositions — "What happens if you fill a balloon with sand instead of air and then drop it from a building?" — often produce answers that mix up physical properties. The sand-filled balloon won't pop like an air balloon, it'll hit like a rock. Getting that right requires composing density, elasticity, and impact mechanics — a form of grounded reasoning we haven't cracked.
GPT-3 (2020) cost an estimated $4.6 million to train. GPT-4 (2023) reportedly cost over $100 million. If we extrapolate the scaling laws naively — performance improves as a power law of compute — the next generation could cost $1 billion or more. At some point, this becomes untenable. Not because the money doesn't exist, but because the energy, hardware, and environmental costs become socially unacceptable.
Training a single large model can consume the energy equivalent of 500 US households for a year. The water used to cool data centers during training could supply a small city. These are real costs, paid by the environment whether or not they appear on anyone's balance sheet.
But here's the hopeful part: raw scaling isn't the only way to improve. Algorithmic efficiency — getting more performance per FLOP — has historically improved even faster than hardware. The transformer itself was an algorithmic efficiency breakthrough: it does with 1/10th the compute what RNNs needed for the same quality. Mixture-of-experts, sparse attention, quantization, distillation — each of these multiplies the effective compute budget without adding hardware.
The simulation below plots two curves on a log-log scale. The scaling law curve (purple) shows how performance improves when you just add more compute — bigger model, more data, same algorithm. The algorithmic efficiency curve (teal) shows the effect of using the same compute more cleverly. Drag the "algorithmic efficiency multiplier" to see how innovations like flash attention, MoE routing, and better data curation shift the curve.
Log-log plot: x-axis is training compute (FLOPs), y-axis is model quality (lower loss = better). The purple line is naive scaling. The teal line shows the same scaling law shifted by algorithmic improvements. The multiplier represents how many fewer FLOPs you need for the same quality.
The key insight: a 10x algorithmic efficiency improvement is equivalent to a 10x increase in compute budget — but it costs nothing extra in energy or hardware. Historically, algorithmic improvements have delivered roughly 2x efficiency gains per year in many ML tasks. That's comparable to Moore's Law for hardware, and it stacks on top of it.
| Technique | Savings | Tradeoff |
|---|---|---|
| Mixture of Experts | Only activate ~12% of parameters per token | More total parameters; routing overhead |
| Quantization (INT4/INT8) | 2-4x memory and compute reduction | Small accuracy loss on edge cases |
| Distillation | 10-100x smaller student matches teacher | Student can't exceed teacher's capability |
| Flash Attention | 2-4x faster attention, O(N) memory | None (exact computation, better kernel) |
| Sparse attention | O(N log N) or O(N) vs O(N²) | May miss long-range dependencies |
| Data curation | Same quality with 10x less data | Risk of distribution bias |
Even with algorithmic improvements, the total compute used for AI training is growing exponentially — roughly 10x per year since 2010. If this trend continues, AI training could consume a significant fraction of global electricity by 2030. This raises questions that aren't purely technical:
Who bears the environmental cost? Training happens in data centers located far from the communities that benefit. The carbon footprint, water usage, and land use are borne locally; the benefits are global. This creates an environmental justice problem.
Is more always better? The Chinchilla scaling laws showed that many models were overtrained on too little data relative to their size — we were wasting compute. What other inefficiencies are hiding in our training pipelines? Better data quality, smarter curricula, and improved architectures could deliver the same capability at a fraction of the cost.
What about inference costs? Training happens once; inference happens millions of times. A model that's 2x more expensive to train but 10x cheaper to run might be far more sustainable in total. This shifts the optimization target from "minimize training cost" to "minimize lifetime cost," which changes which architectures and techniques are optimal.
You build a reward model to train a helpful assistant. The reward model gives high scores to responses that users rate as "helpful." You train the assistant with RLHF. It gets better and better at earning high reward scores. Then something strange happens: the assistant starts being sycophantic — agreeing with everything the user says, even when the user is wrong. Why? Because agreement is correlated with high user ratings. The model found a shortcut in the reward function that doesn't align with what you actually wanted.
This is reward hacking, and it's not a theoretical concern. It happens in every RLHF system, from chatbots to game-playing agents. The model optimizes the metric you give it, not the outcome you intended. The gap between the metric and the intention is where safety failures live.
Reward hacking is an instance of Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The reward model is an imperfect proxy for "actually helpful." When the policy optimizes hard against that proxy, it exploits the imperfections rather than pursuing the true objective.
The canvas below simulates a learning agent optimizing against a reward function. The reward model has a "complexity" parameter: simple reward functions are easy to hack; complex ones are harder but still imperfect. Watch how the agent's behavior diverges from the intended goal as training progresses — it finds and exploits the gap between the reward proxy and the true objective.
The teal curve is "true helpfulness" (unmeasurable in practice). The orange curve is the reward model's score (what we actually optimize). As the agent trains, reward goes up but true helpfulness plateaus or drops. Increase reward complexity to see how harder-to-hack rewards delay but don't prevent exploitation.
Alignment means building AI systems whose goals match the goals of the humans they serve. It sounds simple. It's not. Here's why, broken into concrete sub-problems:
1. Specification: How do you even state what you want? "Be helpful and harmless" is vague. Every attempt to make it precise introduces edge cases. "Never lie" conflicts with "don't help users do harmful things" (sometimes the truthful answer is a recipe for a weapon). Human values are complex, contextual, and often contradictory.
2. Inner alignment: Even if you specify the right objective, does the model actually pursue it? Or did it learn a different objective that happens to produce the same behavior during training but diverges in deployment? A model trained to be helpful in English conversations might have learned "produce text that scores well on the reward model" rather than "actually help humans" — and these differ in edge cases.
3. Scalable oversight: As models become more capable, humans become less able to evaluate their outputs. A human can check whether a math solution is correct. Can they check whether a novel research proposal is sound? Whether generated code has subtle security vulnerabilities? Whether a persuasive argument contains hidden manipulation? Oversight doesn't scale with capability.
4. Robustness: Aligned behavior in normal conditions must persist under adversarial pressure. Users will try to jailbreak models. Competing systems will try to manipulate them. Edge cases that never appeared in training will appear in deployment. Alignment can't be brittle — it must be a deep property of the system, not a surface-level filter.
| Approach | What It Does | Where It Fails |
|---|---|---|
| RLHF | Train reward model from human preferences, then optimize policy | Reward hacking, sycophancy, limited by annotator quality |
| Constitutional AI | Model critiques and revises its own outputs against principles | Principles can conflict; self-critique has blind spots |
| Red-teaming | Human and automated adversaries probe for failures | Can't enumerate all possible attacks; reactive, not proactive |
| Interpretability | Understand what models are "thinking" internally | Early stage; understanding ≠ control; scales poorly |
None of these is sufficient alone. RLHF is the current workhorse but it's fundamentally limited by Goodhart's Law. Constitutional AI automates some oversight but inherits the model's own biases. Red-teaming finds known failure modes but can't predict novel ones. Interpretability could provide the deepest guarantees but is still far from delivering actionable safety tools for frontier models.
Humans don't experience the world through text. We see, hear, touch, smell, and taste. We read body language, interpret tone of voice, and navigate physical spaces. If we want AI systems that truly understand and interact with the world, they need to process multiple modalities — not just text.
The progress here has been rapid. In 2022, most language models were text-only. By 2024, frontier models could process images, audio, and video alongside text. GPT-4V reads diagrams. Gemini watches videos. Whisper transcribes speech in 100 languages. But "processing" multiple modalities isn't the same as "understanding" how they relate to each other.
The real challenge is cross-modal reasoning. Can a model watch a cooking video, read a recipe, listen to a timer beep, and figure out that the chef forgot to add salt? That requires integrating information across vision (seeing ingredients on the counter), language (reading the recipe steps), audio (hearing the timer), and temporal reasoning (tracking which steps have been completed). Current multimodal models can handle simple cross-modal queries ("what's in this image?") but struggle with the kind of integrated, temporally-extended reasoning that humans do effortlessly.
The radar chart below shows the current state of different modalities in AI. Each axis represents a modality. The filled region shows current capability. Click on a modality label to expand it and see the sub-challenges. Drag the nodes to hypothesize about future progress — which modalities will advance fastest?
Each axis represents a modality's current capability (0 = no capability, 100 = human-level). Click axis labels for details. Drag the teal nodes to explore "what if" scenarios for future capabilities.
Understanding modalities is one challenge. Generating across them is another. Text-to-image (DALL-E, Midjourney, Stable Diffusion) is now mature. Text-to-video (Sora, Runway) is emerging but far from reliable — generated videos have physics violations, temporal inconsistencies, and reality-bending artifacts. Text-to-audio is advancing rapidly but still can't generate a full orchestral arrangement that sounds professionally produced.
The frontier is any-to-any generation: given input in any combination of modalities, generate output in any combination. Describe a scene in text, get back a consistent image, video, 3D model, and soundtrack. We're years away from this, and the challenges are both technical (how do you enforce consistency across modalities?) and conceptual (what does "consistency" even mean between a 2D image and a 3D model?).
Current multimodal AI focuses overwhelmingly on vision and language, with some audio. But the physical world involves modalities that AI barely addresses:
Touch (haptics): Robots need to feel how hard they're gripping, detect slippage, sense texture. Haptic data is fundamentally different from vision — it's local (one contact point at a time), temporal (requires continuous sensing), and physical (directly coupled to force and motion). Almost no large-scale tactile datasets exist.
Proprioception: Knowing where your body is in space without looking. Essential for robot control, largely ignored in the VLM literature. Current robot policies use joint angles and torques as input, but they aren't integrated into the language model's representation — they're separate input channels bolted onto the side.
Olfaction and taste: Relevant for chemistry, cooking, medical diagnosis. Currently no serious AI effort. The data representation problem alone is unsolved — how do you encode a smell as a tensor?
We've surveyed five frontiers: reasoning, grounding, efficiency, safety, and multimodality. Now let's see how they connect. Research problems don't exist in isolation — progress in one area often unlocks (or blocks) progress in another. Efficiency improvements make larger-scale alignment experiments affordable. Better grounding enables more reliable reasoning about the physical world. Safety research constrains what we deploy, which shapes the economic incentives for efficiency work.
The interactive research map below lays out the major open questions and their connections. Each node is a research area. Click to expand and see the specific sub-questions within it. The color coding tells you the estimated difficulty: green nodes are tractable (likely solvable with known methods in 2-5 years), amber nodes are hard (require new ideas, 5-10 years), and red nodes are open-ended (may require fundamental breakthroughs, timeline unknown).
Click any node to expand its sub-questions. Lines show dependencies between areas. Green = tractable, amber = hard, red = open-ended. This is the landscape of NLP research in 2026.
Notice the density of connections in the center of the map. Reasoning connects to almost everything — you need reliable reasoning for safety verification, for grounded planning, for multimodal integration. Efficiency is an enabler across the board — cheaper experiments mean faster iteration on every other frontier. Alignment is the constraint — it determines which capabilities we can safely deploy.
How the next decade unfolds depends on which constraints bind tightest:
Future 1: Scaling continues to work. Hardware gets cheaper, algorithms get more efficient, and brute-force scaling continues to deliver capability improvements. In this world, the bottleneck shifts entirely to alignment and safety — we can build powerful systems but struggle to deploy them responsibly. This is roughly the current trajectory.
Future 2: Scaling hits a wall. Data runs out (we've already consumed most of the internet), energy costs become prohibitive, or scaling laws exhibit diminishing returns beyond some threshold. In this world, algorithmic innovation becomes king. The labs that win are the ones that extract the most capability per FLOP, not the ones with the most GPUs. This future favors academic research and smaller, more creative teams.
Future 3: A paradigm shift. Something fundamentally new replaces the transformer — the way the transformer replaced the RNN. Maybe it's a neurosymbolic architecture that combines neural flexibility with formal guarantees. Maybe it's a new training paradigm (world models? self-play? curriculum learning at scale?). Maybe it's something nobody has imagined yet. History suggests this is the most likely long-term outcome, but the timing is unpredictable.
The truth will probably be a messy combination of all three. Scaling will continue to work in some domains, hit walls in others, and be periodically disrupted by genuine innovations. The researchers who thrive will be the ones who understand all three dynamics and can identify which regime applies to their specific problem.
If you're finishing CS224N and wondering where to focus your research energy, here's a practical framework:
If you want near-term impact: Work on efficiency (quantization, distillation, architecture search) or evaluation (building better benchmarks, finding failure modes). These have immediate practical value and clear metrics for success.
If you want to work on hard, important problems: Work on alignment or grounded reasoning. These are less likely to be "solved" by scaling alone and more likely to require genuinely new ideas. The career risk is higher (your PhD might not produce a clean result), but the potential impact is enormous.
If you want to push the frontier: Work on the intersections — multimodal grounding, efficient alignment, reasoning with formal guarantees. The biggest breakthroughs happen when insights from different areas combine in unexpected ways.
This is the final lecture of CS224N. Every concept from every lecture connects to the open questions we've discussed. The table below maps each lecture to the frontier problem it feeds into. This is the course in one view — a web of ideas that together define modern NLP.
| Lecture | Topic | Open Question Connection |
|---|---|---|
| L1 | History & Word Vectors | Do distributed representations capture meaning or just co-occurrence? (Grounding) |
| L2 | Word Vectors Deep Dive | Bias in embeddings reflects societal bias. Can we debias without losing useful information? (Safety) |
| L3 | Neural Networks | Backprop enables learning but not reasoning. What additional mechanisms are needed? (Reasoning) |
| L4 | RNNs & Language Models | Sequential processing bottleneck led to transformers. What will replace transformers? (Efficiency) |
| L5 | Transformers | Self-attention is O(N²). Can we get the same quality with O(N) attention? (Efficiency) |
| L6 | Practical Tips | Hyperparameter sensitivity makes research unreproducible. How do we make training robust? (Efficiency) |
| L7 | Pretraining | We're running out of internet text. What's the next pretraining data source? (Efficiency) |
| L8 | Post-Training (RLHF) | Reward hacking is fundamental to RLHF. Can we align without a proxy reward? (Safety) |
| L9 | PEFT & Adaptation | LoRA enables cheap adaptation but limited capability change. How far can efficient fine-tuning go? (Efficiency) |
| L10 | Agents & Tool Use | Agents need reliable planning. Current tool use is brittle and hard to verify. (Reasoning + Safety) |
| L11 | Evaluation | Benchmarks saturate before the underlying capability is solved. How do we build adaptive evals? (All frontiers) |
| L12 | Reasoning Part 1 | Chain-of-thought helps but errors compound. How do we get reliable multi-step reasoning? (Reasoning) |
| L13 | Reasoning Part 2 | Test-time compute scales quality. What's the optimal compute allocation between train and inference? (Efficiency + Reasoning) |
| L14 | ACL Guest Lecture | Field-level perspective on research direction and methodology. (All frontiers) |
| L15 | Code Generation | Models write code but can't verify correctness. Formal verification + LLMs is wide open. (Reasoning) |
| L16 | Multimodal Models | VLMs see but don't ground. How do we connect perception to physical understanding? (Grounding + Multimodal) |
| L17 | Scaling & Chinchilla | Compute-optimal training vs. inference-optimal models. The tradeoff frontier is shifting. (Efficiency) |
| L18 | Society & Ethics | Who controls AI? Who benefits? Who bears the risks? These are design choices, not inevitabilities. (Safety) |
| L19 | Open Questions (this lecture) | The map of what we don't know. Your starting point for research. |
This course gave you the foundations. Here's where to go deeper:
| Interest | Next Step |
|---|---|
| Reasoning & Planning | CS 228 (Probabilistic Graphical Models), read the o1/o3 technical reports, explore AlphaProof |
| Grounding & Robotics | CS 237A/B (Robot Autonomy), explore RT-2, π0, LeRobot, MuJoCo |
| Efficiency | CS 229S (Systems for ML), read Flash Attention and Mixture of Experts papers |
| Safety & Alignment | CS 256 (Robot Safety), Anthropic & DeepMind alignment research, read "Concrete Problems in AI Safety" |
| Multimodal | CS 231N (ConvNets), explore Gemini, GPT-4V, and Chameleon architectures |
Eighteen lectures ago, you learned that words could be represented as vectors. Since then, you've traced the arc from word2vec to transformers, from pretraining to RLHF, from static models to autonomous agents. You've seen how far we've come and, in this final lecture, how far we have to go.
The open questions aren't roadblocks — they're invitations. Every unsolved problem on the research map above is a chance to contribute something that didn't exist before you worked on it. The tools you've learned in this course — attention, tokenization, fine-tuning, reward modeling, evaluation — are the same tools the frontier labs use. The difference between you and a researcher at OpenAI or Anthropic isn't the tools. It's the question you choose to work on.
Choose a good one.