Zhuge, Zhao, Liu, Zhou et al. — Meta AI & KAUST, 2026

Neural Computers

A neural network that IS the computer — unifying computation, memory, and I/O in a single learned runtime state. Video models that roll out terminal and desktop frames from instructions and actions.

Prerequisites: Video generation basics + diffusion models + computer architecture concepts
10
Chapters
5+
Simulations

Chapter 0: The Problem

Right now, every neural network in the world runs on a computer. PyTorch allocates tensors in CUDA memory. The GPU executes matrix multiplications. The operating system schedules threads and manages I/O. The neural network is a guest — it lives inside a host machine that does the real computing.

But what if you flipped that relationship? What if the neural network was the computer?

Not a metaphor. Not an analogy. A literal computer where the CPU, the RAM, the screen, the keyboard — everything — lives inside a single neural network's forward pass. You type a command, the network processes it, and it renders the next frame of what the screen should look like. No operating system. No file system. No instruction set architecture. Just a learned function that maps inputs and state to visual output.

Think about what that means. When you open a terminal on your laptop and type ls, a chain of specialized hardware and software handles the request: the keyboard controller sends a scan code, the OS kernel processes the interrupt, the shell parses the command, the file system reads the directory, and the display server renders the text. A Neural Computer does all of that in one step: it takes "the user typed ls" as input and generates the next video frame showing the directory listing. Every intermediate step is implicit in the neural network's weights.

The provocation: A conventional computer separates computation (CPU), memory (RAM/disk), and I/O (screen/keyboard) into distinct hardware. A Neural Computer collapses all three into one object: the latent state of a video model. The model's hidden activations are the memory. The diffusion process is the computation. The decoded frames are the display.

This is the core thesis of Zhuge, Zhao, Liu, Zhou et al. from Meta AI and KAUST. They don't just theorize about it — they build two working prototypes. One generates terminal sessions as video. The other generates interactive desktop environments as video, responding to mouse clicks and keyboard input in real time.

Neither prototype is a complete replacement for your laptop. Not even close. But they prove something remarkable: a video diffusion model can learn to simulate a computing interface well enough to render readable text, respond to user actions, and maintain short-term state across frames.

Why this matters beyond novelty

If you can train a model to simulate a computer, you get something conventional software can't offer: graceful degradation. A real computer crashes on invalid input. A Neural Computer produces its best approximation. A real computer can only run programs someone wrote. A Neural Computer can improvise — rendering interfaces that were never explicitly programmed, just learned from data.

The paper also reveals a deep tension: the model is an excellent renderer but a poor reasoner. It can draw a terminal that looks perfect but gets the math wrong. This gap between perception and computation is the central finding, and it shapes everything that follows.

What we'll cover: The formal framework (state update + render loop), two prototypes (terminal and desktop), four action injection architectures with quantitative comparison, findings on symbolic accuracy, data quality, and cursor supervision, and the roadmap toward a Completely Neural Computer.
In a Neural Computer, where does "memory" live?

Chapter 1: The Key Insight

Think about what a conventional computer actually does at the highest level. Every clock cycle, it: (1) reads the current state from memory, (2) takes an input (instruction or I/O event), (3) computes a new state, and (4) produces an output (updated screen, network packet, etc.). That's it. That's the entire abstraction.

Now think about what a video generation model does. Every timestep, it: (1) reads the current latent representation, (2) takes a conditioning signal (text prompt, previous frame, user action), (3) produces a new latent representation via the diffusion process, and (4) decodes that latent into a visible frame.

The structural parallel is exact. State update = diffusion step. Memory = latent activations. Input = conditioning signal. Output = decoded frame. The paper's key insight is that this isn't a loose analogy — it's a formal equivalence. Any system that maintains state, accepts input, and produces output is a computer in the computational-theoretic sense.

A conventional computer stores state across three separate physical substrates. The CPU registers hold immediate values. The RAM holds working data. The disk holds persistent data. Each has different access speeds, different capacities, different interfaces. This separation is an engineering decision, not a theoretical requirement.

A Neural Computer unifies all of this into a single object: the runtime latent state ht. This is the hidden representation inside the video model at timestep t. It encodes everything the model "knows" about the current state of the simulated computer — what's on screen, what was typed before, what should happen next.

The unification in concrete terms

Let's be precise about what "unification" means with actual tensor shapes. In the paper's NCGUIWorld prototype:

The "memory" isn't a metaphor. Those 768 tokens × 1152 dimensions = ~885K continuous values carry everything the model knows about the current state of the simulated computer.

The unification has three consequences the paper explores:

What is the key structural parallel between a video model and a computer?

Chapter 2: The Formalism

The paper formalizes a Neural Computer with two equations. These are deceptively simple, but they encode the entire framework.

ht = Fθ(ht-1, xt, ut)

This is the state update. At each timestep:

xt+1 ~ Gθ(ht)

This is the render step. The decoder (a VAE decoder in practice) takes the new latent state and produces the next visible frame. The tilde (~) means "sampled from" because the diffusion process is stochastic — there's randomness in the generation.

How this maps to actual architecture: In the paper's prototypes, Fθ is the full DiT denoising stack. The "latent state" ht is literally the video latents zt flowing through the transformer blocks. Gθ is the VAE decoder that converts latents back to pixel space. The action ut enters via text encoding (T5 for prompts) or action injection modules (for mouse/keyboard). The observation xt enters via the VAE encoder applied to the current frame.

Mapping to concrete architecture

In the Wan2.1 backbone used by NCCLIGen, the data flow is:

  1. VAE encoder compresses the input frame xt from pixel space (H×W×3) to latent space (h×w×c). This is lossy compression — fine details might be lost, which is why character accuracy (54%) is lower than visual quality (40.77 dB PSNR).
  2. CLIP extracts a global visual embedding from xt — a "what kind of image is this" vector that conditions the generation at a semantic level.
  3. T5 encodes the text prompt ut into a sequence of token embeddings. For NCCLIGen, this is the terminal command description.
  4. Concatenation: VAE latents + CLIP features + T5 embeddings + noise ε are assembled into the input to the DiT. This concatenation is the "instruction fetch" of the Neural Computer.
  5. DiT denoising: 30-50 iterative steps of the diffusion process. Each step refines the noisy latent toward a clean prediction. This is the "execution" phase — the bulk of computation happens here.
  6. VAE decoder converts the denoised latent back to pixel space. This is the "display update."

Compare this to a conventional computer's fetch-decode-execute cycle:

Classical ComputerNeural Computer
Fetch instruction from memoryEncode current frame xt + action ut
Decode opcodeConditioning injection into DiT
Execute ALU operationDiT denoising (Fθ)
Write result to registers/RAMNew latent state ht
Update display bufferVAE decode to frame (Gθ)

The critical difference: in a classical computer, each step is a discrete, deterministic operation on bits. In a Neural Computer, the entire cycle is a single differentiable forward pass through a neural network, operating on continuous-valued tensors.

Why video models specifically? A video model is the natural choice because it already operates as a sequential state machine. Each frame depends on all previous frames through the temporal attention in the DiT. The latent state propagates information forward in time, exactly like registers carry values between clock cycles. Image models can't do this — they generate frames independently with no temporal state.

The stochasticity problem

There's a subtle issue in the render equation xt+1 ~ Gθ(ht). The tilde means the output is sampled, not deterministically computed. Run the same model with the same latent state twice, and you'll get different frames (because the diffusion process uses random noise). For a Neural Computer, this is a bug, not a feature — computers should be deterministic. The paper acknowledges this as one of the four requirements for a Completely Neural Computer (Chapter 7).

In practice, you can mitigate this by fixing the random seed during inference, but that's a workaround, not a solution. True determinism would require either a deterministic sampling method (like DDIM with zero noise) or a training objective that penalizes output variance for identical inputs.

In the formalism ht = Fθ(ht-1, xt, ut), what role does the DiT play?

Chapter 3: NCCLIGen — CLI as Video

The first prototype turns a text terminal into a video generation problem. You give it a text prompt describing a terminal command and an image of the first frame. The model generates the subsequent frames showing what the terminal output looks like.

This sounds deceptively simple. It is not. Terminal output is dense text rendered at 13px font size. Every character must be pixel-perfect or the output is unreadable garbage. This is the hardest possible rendering task for a video model — no blurry backgrounds to hide behind.

Why terminals are the hardest test

You might think a terminal — monochrome background, fixed-width text — would be easy for a generative model. It's the opposite. In a natural image, a slightly wrong pixel is invisible. In a terminal, a single wrong pixel turns an 'a' into an 'o', or a '1' into an 'l'. There's no redundancy. Every pixel carries information. This is why the paper starts with terminals: if the NC can render readable terminal text, it can render anything.

The evaluation reflects this: the paper uses OCR (optical character recognition) on generated frames to measure exact character accuracy, not just visual similarity. This is a much harder bar than PSNR alone.

Architecture

NCCLIGen is built on Wan2.1 I2V (image-to-video), a recent open-source video generation model. The pipeline has five stages:

  1. VAE encodes the first terminal frame into latent space
  2. CLIP extracts visual features from the first frame (used as a global conditioning signal)
  3. T5 encoder processes the text prompt into token embeddings
  4. These three signals — latent, CLIP features, text embeddings — are concatenated with diffusion noise and fed into the DiT stack
  5. The DiT denoises iteratively, and the VAE decoder produces the output video frames

Data pipeline

The paper builds two datasets from asciinema recordings (terminal screen captures):

A critical finding on captions: The paper tests two captioning strategies. Semantic captions describe what's happening ("installing Python packages"). Detailed captions specify the literal text content ("running pip install numpy, output shows version 1.24.0"). Detailed captions massively outperform: 26.89 dB vs 21.90 dB PSNR. Terminals need literal specification — "installing packages" is too vague to render exact text.

Training and results

Training cost: ~15,000 H100 GPU-hours for General, ~7,000 for Clean. The key metrics at 13px font rendering:

MetricValueWhat it means
PSNR40.77 dBVery high — near pixel-perfect frame quality
SSIM0.989Structural similarity — layout and text positions are correct
Char accuracy0.5454% of individual characters are exactly right (OCR'd)
Line accuracy0.3131% of full lines are character-perfect

Notice the gap: PSNR/SSIM are excellent (the frames look right) but character accuracy is only 54%. The model renders text that is visually convincing but not always symbolically correct. This tension — visual fidelity vs. symbolic accuracy — is a recurring theme.

What 0.54 character accuracy actually looks like: Imagine a terminal line that should read numpy==1.24.0. At 54% accuracy, the model might render numpy==1.24.O (capital O instead of zero) or numpv==1.24.0 (v instead of y). The text is almost right — plausible enough to fool a human glancing at the screen, wrong enough to break any program that parses it. This is the uncanny valley of text rendering.

Training cost perspective

15,000 H100 GPU-hours is substantial but not extraordinary by 2026 standards. For reference, training Llama 3 70B took ~1.7M GPU-hours. NCCLIGen is about 100× cheaper — but it's also doing a fundamentally simpler task (rendering terminal frames, not understanding language). The ~7,000 hours for CLIGen(Clean) suggests that with better data curation, training can be significantly more efficient.

Why do detailed captions dramatically outperform semantic captions for terminal rendering?

Chapter 4: NCGUIWorld — Interactive Desktop

NCCLIGen is impressive but non-interactive — you specify the command upfront and watch the video play out. NCGUIWorld goes further: it generates a full graphical desktop environment that responds to mouse clicks and keyboard input in real time.

The target: Ubuntu 22.04 with XFCE4 desktop, 1024×768 resolution, 15 FPS. The model must learn to render windows, menus, cursors, text, and icons — and update them correctly in response to user actions.

Data sources

The paper collects training data from three sources, and the comparison between them reveals a crucial insight about data quality:

SourceHoursMethodFVD ↓
Random Slow~1,000Random mouse/keyboard at slow pace48.17
Random Fast~400Random mouse/keyboard at fast pace20.37
Claude CUA~110Claude Computer Use Agent performing real tasks14.72
110 hours of supervised data beats 1,400 hours of random data. Claude CUA data has an FVD of 14.72 versus 20.37 for Random Fast and 48.17 for Random Slow. The supervised data is 10× smaller but 3× better. Why? Because Claude performs purposeful actions — opening apps, clicking menus, typing commands. Random actions produce incoherent trajectories that teach the model nothing about the causal structure of a desktop environment.

Cursor supervision: a microcosm of the NC challenge

One surprisingly hard sub-problem: getting the model to render the cursor in the right position. This is a microcosm of the entire Neural Computer challenge — how do you ground abstract spatial information (coordinates) into pixel-level rendering?

The cursor is a small visual element (typically 12×19 pixels) that must appear at a precise location determined by the user's mouse input. In a real computer, this is trivial: the OS composites the cursor sprite at (x,y). In an NC, the model must learn that the action embedding "mouse at (542, 387)" means drawing a cursor at a specific pixel location. This is surprisingly difficult because the mapping from coordinate space to pixel space is arbitrary — it depends on the screen resolution, the VAE downsampling factor, and the cursor style.

The paper tries three progressively better supervision strategies:

StrategyCursor accuracyHow it works
Position-only8.7%Action embedding includes (x,y) coordinates
Position + Fourier13.5%Fourier-encode coordinates for finer spatial signal
SVG mask/reference98.7%Render cursor as an SVG overlay, give model a reference frame with cursor already drawn

The jump from 13.5% to 98.7% is remarkable. Spatial coordinates alone are too abstract for the model to reliably ground into pixel space. But showing the model a reference frame with the cursor already rendered at the correct position? That's a visual signal the model can copy directly. Pixel-level supervision dominates abstract supervision.

Meta-action vs raw-action

Actions can be injected as raw events (mouse_move(x,y), key_press('a'), click(left)) or as meta-actions (API-like commands: type_text("hello"), click_element("OK button")). Meta-actions slightly outperform: SSIM 0.863 vs 0.847. The structured, higher-level representation is easier for the model to condition on.

Think of it this way: Raw actions are like assembly language — low-level, precise, hard to learn patterns from. Meta-actions are like a high-level API — they carry more semantic information per token. When you say click_element("OK button"), the model knows both the action (click) and the target (a button labeled OK). When you say click(542, 387), the model has to figure out what's at those coordinates by itself.

Resolution and frame rate

NCGUIWorld targets 1024×768 at 15 FPS. This is deliberately conservative — modern desktops run at 1920×1080 or higher at 60 FPS. The lower resolution reduces the latent space size (fewer visual tokens), and the lower frame rate means fewer denoising steps per second. Even at these reduced settings, each frame requires a full diffusion inference pass. Real-time interaction requires either much faster models or clever scheduling (e.g., predicting multiple frames ahead while the user hasn't acted).

Why does 110 hours of Claude CUA data beat 1,400 hours of random interaction data?

Chapter 5: Action Injection Modes

The most detailed engineering contribution in the paper is the comparison of four ways to inject user actions into the video generation backbone. This is the central design question for any interactive Neural Computer: where and how does user input enter the diffusion process?

All four modes take the same input (action features) and target the same backbone (the DiT transformer blocks). They differ in where the action signal is injected — before, alongside, around, or inside the transformer.

Mode 1: External

Action features modulate the VAE latents before they enter the transformer. Think of it as adjusting the input signal before it hits the processing pipeline. The action is encoded into a feature vector, which is added to or element-wise multiplied with the initial latent zt. After this modulation, the transformer processes the modified latents normally — it never directly "sees" the action, only its residual effect on the input.

Result: SSIM 0.746, FVD 33.4. The worst performer. By the time action information propagates through 30+ transformer layers, it's been diluted beyond recognition. The early layers overwrite the subtle modulation with their own learned representations.

Mode 2: Contextual

Action tokens are concatenated with visual tokens in the self-attention sequence. If the visual sequence has 768 tokens and the action has 16 tokens, the self-attention now operates on 784 tokens. The transformer sees both visual patches and action tokens, and self-attention lets them interact freely. This is analogous to how text conditioning works in many text-to-image diffusion models.

Result: SSIM 0.824, FVD 21.2. Better — the action information is persistent across all layers because it's part of the token sequence. But it competes for attention bandwidth with the much larger set of visual tokens. In a 768-visual + 16-action sequence, each action token gets only ~2% of the attention budget.

Mode 3: Residual

ControlNet-style architecture. A parallel branch of transformer blocks processes action features separately, and their outputs are added as residual connections to the main backbone. The main backbone's weights are frozen; only the residual branch is trained. This is exactly how ControlNet adds spatial conditioning (edges, depth maps) to Stable Diffusion.

Result: SSIM 0.841, FVD 18.6. Good, but the residual branch roughly doubles the parameter count for action processing. The additive injection also means action information and visual information share the same representation space, which can cause interference — the action signal may fight with existing visual features rather than cleanly augmenting them.

Mode 4: Internal (best)

Each transformer block gets a dedicated action cross-attention layer. Visual tokens attend to action features through cross-attention, the same way they attend to text embeddings in standard text-to-image diffusion. The operation is: Q = WQ · visual_tokens, K = WK · action_tokens, V = WV · action_tokens. Each layer independently learns how much to attend to the action and what aspects of the action to incorporate.

Result: SSIM 0.863, FVD 14.5. The clear winner.

This is the cleanest integration: action information enters at every layer through a dedicated pathway that doesn't compete with visual self-attention. The cross-attention weights are small (action sequences are short), so the computational overhead is minimal.

Why Internal wins: Cross-attention gives each transformer block direct, uncontaminated access to the action signal. Every layer can independently decide how much to attend to the action vs. the visual context. In External mode, layers 20-30 barely see the action. In Contextual mode, action tokens fight for attention with hundreds of visual tokens. Internal cross-attention gives action information a private channel at every depth.
ModeWhere action entersSSIM ↑FVD ↓
ExternalBefore transformer (modulates latents)0.74633.4
ContextualAlongside visual tokens (self-attn)0.82421.2
ResidualParallel ControlNet-style branch0.84118.6
InternalDedicated cross-attention per block0.86314.5
Why does Internal (cross-attention) injection outperform all other modes?

Chapter 6: Key Findings

Beyond the architectural choices, the paper surfaces three findings that reveal what Neural Computers can and cannot do.

Finding 1: Symbolic instability

NCCLIGen achieves 40.77 dB PSNR (frames look great) but only 0.54 character accuracy (text is often wrong). The model learns the appearance of terminal output — monospaced text in neat rows — without reliably encoding the content. This is the fundamental gap between perceptual fidelity and symbolic correctness.

For arithmetic, native accuracy is just 4%. Ask the terminal model to compute "2 + 3" and it'll render a convincing-looking answer that's usually wrong. Compare this to Sora2, which achieves 71% on the same task. The NC learns to render, not to reason.

Finding 2: Reprompting recovers accuracy

Here's the twist: when you take the NC's output frame and reprompt the model — "the answer on screen is 5, now render the terminal showing this result" — accuracy jumps from 4% to 83%. The NC is a powerful conditional renderer. It can draw whatever you tell it to draw. It just can't figure out what to draw on its own for symbolic tasks.

The implication: A Neural Computer doesn't need to be a native reasoner. It needs to be a perfect conditionable renderer. Pair it with an external reasoning engine (an LLM, a calculator, a Python interpreter) that decides what to display, and the NC renders it. This is the hybrid architecture the paper implicitly points toward.

Finding 3: Training metrics plateau

PSNR and SSIM plateau around 25,000 training steps. After that, more compute yields diminishing returns on visual quality. But character-level accuracy continues improving slowly past 60,000 steps. The model learns coarse visual structure first, then slowly refines symbolic detail — a classic curriculum effect.

The training curriculum in hindsight: Early training (0-10k steps): the model learns that terminals are dark backgrounds with light text in rows. Mid training (10k-25k): character shapes become crisp, line spacing becomes consistent, PSNR/SSIM saturate. Late training (25k-60k+): subtle improvements in character identity (distinguishing 0 from O, 1 from l, m from rn). Visual structure is easy; symbolic fidelity is hard.

The reprompting insight deeper

The 4% → 83% jump via reprompting deserves more attention. It means the NC's bottleneck is not rendering capacity but reasoning capacity. If you tell it the answer, it can render it. If you ask it to compute the answer, it fails. This suggests a clean division of labor in future systems: use an LLM or symbolic engine to decide what to display, and use the NC to display it. The NC becomes a universal learned display driver.

How does reprompting improve NCCLIGen's arithmetic accuracy from 4% to 83%?

Chapter 7: The CNC Vision

The paper doesn't stop at prototypes. Section 4 lays out a roadmap for a Completely Neural Computer (CNC) — a system that would replace a conventional computer entirely. The authors define four requirements for completeness:

(i) Turing complete

The NC must be able to simulate any computation that a Turing machine can perform. In theory, a sufficiently large neural network is a universal function approximator, so this is achievable. In practice, the current prototypes can't reliably execute multi-step algorithms. A Turing machine needs reliable state transitions — if state A with input 1 goes to state B, it must always go to state B. Current NCs have no such guarantee.

(ii) Universally programmable

A user must be able to specify any task through the NC's input interface. Current NCs accept text prompts and action sequences, which is flexible but not equivalent to a programming language. You can't write a for-loop in a mouse click.

(iii) Behavior-consistent

The NC must produce the same output for the same input unless explicitly reprogrammed. Deterministic behavior is the default expectation for computers. But diffusion models are inherently stochastic — the same prompt produces different frames each time. Achieving consistency requires either deterministic sampling or learned invariance.

(iv) Architectural advantages

A CNC must offer something a conventional computer cannot. The paper identifies several candidates:

How far are we? The paper is honest: current NCs satisfy none of the four requirements fully. They can render convincing frames (partial I/O alignment), respond to short action sequences (partial control), and maintain visual coherence over short horizons (partial consistency). But reliable symbolic computation, long-horizon planning, and stable capability reuse remain open problems.
CNC RequirementStatusBottleneck
(i) Turing completeTheoretical onlyCan't execute multi-step algorithms reliably
(ii) Universally programmablePartialText/action input is flexible but not a programming language
(iii) Behavior-consistentWeakDiffusion stochasticity — same input → different output
(iv) Architectural advantagePromisingNatural language input + graceful degradation demonstrated
The path forward: The paper suggests that scaling alone won't solve (i) and (iii). Turing completeness may require hybrid architectures (NC + symbolic engine). Behavior consistency may require deterministic sampling or learned invariance losses. The most achievable near-term goal is (iv) — building NCs that offer capabilities conventional computers don't, even if they can't fully replace them.
Which CNC requirement is most fundamentally at odds with how diffusion models work?

Chapter 8: What NCs Can't Do

Intellectual honesty is critical here. The paper presents exciting prototypes, but the gap between "generates convincing terminal video" and "replaces a computer" is vast. Let's be precise about what's missing.

Reliable symbolic computation

4% arithmetic accuracy means the NC cannot perform the most basic computational task. Even with reprompting (83%), you need an external oracle to provide the correct answer. A computer that needs another computer to tell it what to compute is not yet a computer.

Long-horizon reasoning

Current video models generate 5-10 seconds of coherent video. A typical computing session lasts hours. State information in the latent representation degrades over time — the model forgets what was on screen 30 frames ago. There's no mechanism for persistent memory beyond the rolling latent window.

Stable capability reuse

If you teach the NC to render a text editor, that capability doesn't compose with rendering a file browser. Each skill must be independently trained. A conventional computer's composability — any program can use any other program — has no analogue in the current NC architecture.

Explicit runtime governance

You can't inspect what the NC is "thinking." The latent state ht is a high-dimensional vector (hundreds of thousands of continuous values) with no interpretable structure. You can't set breakpoints, examine memory, or debug a Neural Computer. If it renders the wrong output, you can't trace why — which of the 885K latent values caused the error? You can only retrain or reprompt.

This is perhaps the most practically devastating limitation. Every useful computer needs a way to diagnose failures. When your real computer crashes, you get an error message, a stack trace, a core dump. When a Neural Computer produces a wrong frame, you get... a wrong frame. There's no error channel. The model doesn't know it's wrong.

The honest summary: Neural Computers are currently strong renderers and weak reasoners. They can produce visually faithful simulations of computing interfaces. They cannot perform the actual computation those interfaces represent. The gap between rendering a terminal and being a terminal is the entire field of symbolic AI.

The scaling question

Can you solve these limitations by simply making the model bigger and training on more data? The paper's evidence suggests no, at least not for symbolic computation. NCCLIGen was trained on ~1,100 hours of terminal data, yet arithmetic accuracy is 4%. The model has seen millions of arithmetic examples in terminal outputs — it has the data. The architecture is the bottleneck: diffusion models generate pixels, not symbols. They optimize for perceptual similarity (PSNR, SSIM), not symbolic correctness. No amount of scaling changes what the loss function rewards.

What would need to change

An analogy to early electronics: The first electronic computers (ENIAC, 1946) could compute, but they were room-sized, unreliable, and required physical rewiring to change programs. Neural Computers in 2026 are at a similar stage: they prove the concept works, but the engineering isn't there yet. The gap between ENIAC and your laptop took 80 years. The gap between current NCs and a CNC may be shorter — or may require entirely new ideas we haven't had yet.
What is the fundamental gap between current Neural Computers and actual computers?

Chapter 9: Connections

Neural Computers sit at the intersection of several research threads. Understanding where they fit helps you see what comes next.

Diffusion Transformers (DiT)

The entire NC framework depends on DiT as the backbone. The DiT provides the state-update function Fθ — it's the "processor" of the Neural Computer. Advances in DiT efficiency (faster sampling, better latent spaces, longer context windows) directly translate to better NCs. The paper's prototypes use Wan2.1, but any sufficiently capable video DiT could serve as the substrate.

Stable Diffusion 3 and video generation

SD3's rectified flow formulation and MMDiT architecture represent the current frontier of diffusion-based generation. NCCLIGen and NCGUIWorld build on this lineage — the VAE, the text conditioning, the classifier-free guidance all trace back to the SD/SDXL/SD3 line. As video generation models improve in temporal coherence and resolution, NCs automatically inherit those gains.

World models and game simulation

NCs are closely related to world models like GameGen, Genie, and DIAMOND, which simulate game environments as video. Google's Genie (2024) showed that a single model could generate playable 2D platformer levels from a single image. DIAMOND (2024) learned to simulate Atari games with enough fidelity to train RL agents in the simulated environment.

The key difference: games have well-defined physics and reward signals. A ball falls at 9.8 m/s². A coin gives +100 points. Computing interfaces have no physics — the "rules" are the arbitrary conventions of operating system design. Why does clicking "File" open a dropdown? Because someone programmed it that way, not because of any physical law. NCs must learn these conventions purely from observation, which makes the learning problem harder.

Computer Use Agents

Claude CUA, GPT-4V browsing, and similar systems use real computers through screenshots and action APIs. NCs approach the same problem from the opposite direction: instead of an AI controlling a real computer, the AI becomes the computer. A future hybrid might use an LLM for reasoning and an NC for rendering, getting the best of both.

OneVL and video prediction as auxiliary

OneVL uses video prediction as an auxiliary training signal to improve visual understanding. NCs take the same idea to its logical extreme: video prediction is the entire system, not just an auxiliary task. Both approaches validate the thesis that learning to predict future frames builds deep understanding of visual dynamics.

SystemApproachRelationship to NCs
DiTDiffusion backboneThe "processor" inside every NC
SD3/Wan2.1Image/video generationThe generation framework NCs build on
Genie/DIAMONDGame world modelsSame idea for games, NCs extend to computing
Claude CUAAI controlling real OSInverse approach — AI uses computer vs. AI is computer
OneVLVideo prediction as auxiliaryNCs make video prediction the whole system

Neural Turing Machines and Differentiable Computers

Neural Computers are not the first attempt to make neural networks act as computers. Graves et al. (2014) built the Neural Turing Machine (NTM), which added external memory banks with differentiable read/write heads. The follow-up, Differentiable Neural Computer (DNC, 2016), improved memory allocation and linking. These systems could learn algorithms like sorting and copying from input-output examples.

The difference: NTMs and DNCs operate on abstract symbols in an external memory. NCs operate on pixels in a visual output space. NTMs compute; NCs render. The paper's NCs are closer to "learned display servers" than to "learned CPUs." A future synthesis might combine the symbolic capability of NTMs with the visual rendering of NCs.

The bigger picture: Neural Computers represent a philosophical boundary in AI — the point where the tool becomes the medium. If a video model can learn to simulate a computer well enough, the distinction between "running on hardware" and "being the hardware" blurs. We're not there yet, but these prototypes show the direction of travel. The paper's most lasting contribution may not be the prototypes themselves, but the formal framework (Eq 2.1) that gives future work a shared language for describing learned computing systems.
How do Neural Computers relate to Computer Use Agents like Claude CUA?