Neural Computers

Chapter 0: The Problem

Right now, every neural network in the world runs on a computer. PyTorch allocates tensors in CUDA memory. The GPU executes matrix multiplications. The operating system schedules threads and manages I/O. The neural network is a guest — it lives inside a host machine that does the real computing.

But what if you flipped that relationship? What if the neural network was the computer?

Not a metaphor. Not an analogy. A literal computer where the CPU, the RAM, the screen, the keyboard — everything — lives inside a single neural network's forward pass. You type a command, the network processes it, and it renders the next frame of what the screen should look like. No operating system. No file system. No instruction set architecture. Just a learned function that maps inputs and state to visual output.

Think about what that means. When you open a terminal on your laptop and type ls, a chain of specialized hardware and software handles the request: the keyboard controller sends a scan code, the OS kernel processes the interrupt, the shell parses the command, the file system reads the directory, and the display server renders the text. A Neural Computer does all of that in one step: it takes "the user typed ls" as input and generates the next video frame showing the directory listing. Every intermediate step is implicit in the neural network's weights.

The provocation: A conventional computer separates computation (CPU), memory (RAM/disk), and I/O (screen/keyboard) into distinct hardware. A Neural Computer collapses all three into one object: the latent state of a video model. The model's hidden activations are the memory. The diffusion process is the computation. The decoded frames are the display.

This is the core thesis of Zhuge, Zhao, Liu, Zhou et al. from Meta AI and KAUST. They don't just theorize about it — they build two working prototypes. One generates terminal sessions as video. The other generates interactive desktop environments as video, responding to mouse clicks and keyboard input in real time.

Neither prototype is a complete replacement for your laptop. Not even close. But they prove something remarkable: a video diffusion model can learn to simulate a computing interface well enough to render readable text, respond to user actions, and maintain short-term state across frames.

Why this matters beyond novelty

If you can train a model to simulate a computer, you get something conventional software can't offer: graceful degradation. A real computer crashes on invalid input. A Neural Computer produces its best approximation. A real computer can only run programs someone wrote. A Neural Computer can improvise — rendering interfaces that were never explicitly programmed, just learned from data.

The paper also reveals a deep tension: the model is an excellent renderer but a poor reasoner. It can draw a terminal that looks perfect but gets the math wrong. This gap between perception and computation is the central finding, and it shapes everything that follows.

What we'll cover: The formal framework (state update + render loop), two prototypes (terminal and desktop), four action injection architectures with quantitative comparison, findings on symbolic accuracy, data quality, and cursor supervision, and the roadmap toward a Completely Neural Computer.

In a Neural Computer, where does "memory" live?

In dedicated RAM chips on the motherboard In the latent state of the video model — the hidden activations ARE the memory In a separate memory module attached to the neural network

Chapter 1: The Key Insight

Think about what a conventional computer actually does at the highest level. Every clock cycle, it: (1) reads the current state from memory, (2) takes an input (instruction or I/O event), (3) computes a new state, and (4) produces an output (updated screen, network packet, etc.). That's it. That's the entire abstraction.

Now think about what a video generation model does. Every timestep, it: (1) reads the current latent representation, (2) takes a conditioning signal (text prompt, previous frame, user action), (3) produces a new latent representation via the diffusion process, and (4) decodes that latent into a visible frame.

The structural parallel is exact. State update = diffusion step. Memory = latent activations. Input = conditioning signal. Output = decoded frame. The paper's key insight is that this isn't a loose analogy — it's a formal equivalence. Any system that maintains state, accepts input, and produces output is a computer in the computational-theoretic sense.

A conventional computer stores state across three separate physical substrates. The CPU registers hold immediate values. The RAM holds working data. The disk holds persistent data. Each has different access speeds, different capacities, different interfaces. This separation is an engineering decision, not a theoretical requirement.

A Neural Computer unifies all of this into a single object: the runtime latent state h_t. This is the hidden representation inside the video model at timestep t. It encodes everything the model "knows" about the current state of the simulated computer — what's on screen, what was typed before, what should happen next.

The unification in concrete terms

Let's be precise about what "unification" means with actual tensor shapes. In the paper's NCGUIWorld prototype:

Input frame x_t: 1024×768×3 RGB image (the current screen)
After VAE encoding: compressed to latent z_t of shape ~128×96×4 (spatial dimensions shrunk 8×, 4 latent channels)
Inside the DiT: these latents are patchified into ~768 visual tokens, each of dimension 1152. This sequence of tokens is h_t — the runtime state.
After DiT denoising: the tokens are un-patchified back to latent shape, and the VAE decoder produces the next 1024×768×3 frame

The "memory" isn't a metaphor. Those 768 tokens × 1152 dimensions = ~885K continuous values carry everything the model knows about the current state of the simulated computer.

The unification has three consequences the paper explores:

Computation is continuous. Instead of discrete logic gates, the "CPU" is a differentiable function. This means you can train it end-to-end with gradient descent.
Memory is distributed. There's no memory address bus. Information is stored as patterns across all latent dimensions simultaneously, like a hologram rather than a filing cabinet.
I/O is perceptual. Input comes as pixels and text embeddings. Output comes as rendered frames. The interface is visual, not binary.

What is the key structural parallel between a video model and a computer?

Both maintain state (latent/memory), accept input (conditioning/keyboard), compute updates (diffusion/CPU), and produce output (frames/display) Both use transistors to perform binary logic operations Both can run Linux

Chapter 2: The Formalism

The paper formalizes a Neural Computer with two equations. These are deceptively simple, but they encode the entire framework.

h_t = F_θ(h_t-1, x_t, u_t)

This is the state update. At each timestep:

h_t-1 is the previous runtime state — the latent representation that encodes everything the NC "remembers"
x_t is the current screen frame (the observation)
u_t is the user action (text prompt, mouse click, keyboard press)
F_θ is the learned state-update function — in practice, this is the Diffusion Transformer (DiT)
h_t is the new runtime state

x_t+1 ~ G_θ(h_t)

This is the render step. The decoder (a VAE decoder in practice) takes the new latent state and produces the next visible frame. The tilde (~) means "sampled from" because the diffusion process is stochastic — there's randomness in the generation.

How this maps to actual architecture: In the paper's prototypes, F_θ is the full DiT denoising stack. The "latent state" h_t is literally the video latents z_t flowing through the transformer blocks. G_θ is the VAE decoder that converts latents back to pixel space. The action u_t enters via text encoding (T5 for prompts) or action injection modules (for mouse/keyboard). The observation x_t enters via the VAE encoder applied to the current frame.

Mapping to concrete architecture

In the Wan2.1 backbone used by NCCLIGen, the data flow is:

VAE encoder compresses the input frame x_t from pixel space (H×W×3) to latent space (h×w×c). This is lossy compression — fine details might be lost, which is why character accuracy (54%) is lower than visual quality (40.77 dB PSNR).
CLIP extracts a global visual embedding from x_t — a "what kind of image is this" vector that conditions the generation at a semantic level.
T5 encodes the text prompt u_t into a sequence of token embeddings. For NCCLIGen, this is the terminal command description.
Concatenation: VAE latents + CLIP features + T5 embeddings + noise ε are assembled into the input to the DiT. This concatenation is the "instruction fetch" of the Neural Computer.
DiT denoising: 30-50 iterative steps of the diffusion process. Each step refines the noisy latent toward a clean prediction. This is the "execution" phase — the bulk of computation happens here.
VAE decoder converts the denoised latent back to pixel space. This is the "display update."

Compare this to a conventional computer's fetch-decode-execute cycle:

Classical Computer	Neural Computer
Fetch instruction from memory	Encode current frame x_t + action u_t
Decode opcode	Conditioning injection into DiT
Execute ALU operation	DiT denoising (F_θ)
Write result to registers/RAM	New latent state h_t
Update display buffer	VAE decode to frame (G_θ)

The critical difference: in a classical computer, each step is a discrete, deterministic operation on bits. In a Neural Computer, the entire cycle is a single differentiable forward pass through a neural network, operating on continuous-valued tensors.

Why video models specifically? A video model is the natural choice because it already operates as a sequential state machine. Each frame depends on all previous frames through the temporal attention in the DiT. The latent state propagates information forward in time, exactly like registers carry values between clock cycles. Image models can't do this — they generate frames independently with no temporal state.

The stochasticity problem

There's a subtle issue in the render equation x_t+1 ~ G_θ(h_t). The tilde means the output is sampled, not deterministically computed. Run the same model with the same latent state twice, and you'll get different frames (because the diffusion process uses random noise). For a Neural Computer, this is a bug, not a feature — computers should be deterministic. The paper acknowledges this as one of the four requirements for a Completely Neural Computer (Chapter 7).

In practice, you can mitigate this by fixing the random seed during inference, but that's a workaround, not a solution. True determinism would require either a deterministic sampling method (like DDIM with zero noise) or a training objective that penalizes output variance for identical inputs.

In the formalism h_t = F_θ(h_t-1, x_t, u_t), what role does the DiT play?

It decodes the latent state into visible pixels It IS the state-update function F_θ — taking previous state, current frame, and action to produce the new latent state It encodes user actions into embedding vectors

Chapter 3: NCCLIGen — CLI as Video

The first prototype turns a text terminal into a video generation problem. You give it a text prompt describing a terminal command and an image of the first frame. The model generates the subsequent frames showing what the terminal output looks like.

This sounds deceptively simple. It is not. Terminal output is dense text rendered at 13px font size. Every character must be pixel-perfect or the output is unreadable garbage. This is the hardest possible rendering task for a video model — no blurry backgrounds to hide behind.

Why terminals are the hardest test

You might think a terminal — monochrome background, fixed-width text — would be easy for a generative model. It's the opposite. In a natural image, a slightly wrong pixel is invisible. In a terminal, a single wrong pixel turns an 'a' into an 'o', or a '1' into an 'l'. There's no redundancy. Every pixel carries information. This is why the paper starts with terminals: if the NC can render readable terminal text, it can render anything.

The evaluation reflects this: the paper uses OCR (optical character recognition) on generated frames to measure exact character accuracy, not just visual similarity. This is a much harder bar than PSNR alone.

Architecture

NCCLIGen is built on Wan2.1 I2V (image-to-video), a recent open-source video generation model. The pipeline has five stages:

VAE encodes the first terminal frame into latent space
CLIP extracts visual features from the first frame (used as a global conditioning signal)
T5 encoder processes the text prompt into token embeddings
These three signals — latent, CLIP features, text embeddings — are concatenated with diffusion noise and fed into the DiT stack
The DiT denoises iteratively, and the VAE decoder produces the output video frames

Data pipeline

The paper builds two datasets from asciinema recordings (terminal screen captures):

CLIGen(General): ~824,000 clips from the asciinema.org public gallery. About 1,100 hours of terminal video. These are messy, real-world recordings — people installing packages, editing config files, running scripts.
CLIGen(Clean): ~128,000 clips generated by running deterministic Docker scripts. Each clip has a known ground truth because the commands are scripted, not live.

A critical finding on captions: The paper tests two captioning strategies. Semantic captions describe what's happening ("installing Python packages"). Detailed captions specify the literal text content ("running pip install numpy, output shows version 1.24.0"). Detailed captions massively outperform: 26.89 dB vs 21.90 dB PSNR. Terminals need literal specification — "installing packages" is too vague to render exact text.

Training and results

Training cost: ~15,000 H100 GPU-hours for General, ~7,000 for Clean. The key metrics at 13px font rendering:

Metric	Value	What it means
PSNR	40.77 dB	Very high — near pixel-perfect frame quality
SSIM	0.989	Structural similarity — layout and text positions are correct
Char accuracy	0.54	54% of individual characters are exactly right (OCR'd)
Line accuracy	0.31	31% of full lines are character-perfect

Notice the gap: PSNR/SSIM are excellent (the frames look right) but character accuracy is only 54%. The model renders text that is visually convincing but not always symbolically correct. This tension — visual fidelity vs. symbolic accuracy — is a recurring theme.

What 0.54 character accuracy actually looks like: Imagine a terminal line that should read numpy==1.24.0. At 54% accuracy, the model might render numpy==1.24.O (capital O instead of zero) or numpv==1.24.0 (v instead of y). The text is almost right — plausible enough to fool a human glancing at the screen, wrong enough to break any program that parses it. This is the uncanny valley of text rendering.

Training cost perspective

15,000 H100 GPU-hours is substantial but not extraordinary by 2026 standards. For reference, training Llama 3 70B took ~1.7M GPU-hours. NCCLIGen is about 100× cheaper — but it's also doing a fundamentally simpler task (rendering terminal frames, not understanding language). The ~7,000 hours for CLIGen(Clean) suggests that with better data curation, training can be significantly more efficient.

Why do detailed captions dramatically outperform semantic captions for terminal rendering?

Because terminals need literal text specification — "installing packages" is too vague to render exact characters, while "pip install numpy, output shows version 1.24.0" tells the model exactly what text to draw Because semantic captions use fewer tokens Because the T5 encoder was pretrained on detailed text

Chapter 4: NCGUIWorld — Interactive Desktop

NCCLIGen is impressive but non-interactive — you specify the command upfront and watch the video play out. NCGUIWorld goes further: it generates a full graphical desktop environment that responds to mouse clicks and keyboard input in real time.

The target: Ubuntu 22.04 with XFCE4 desktop, 1024×768 resolution, 15 FPS. The model must learn to render windows, menus, cursors, text, and icons — and update them correctly in response to user actions.

Data sources

The paper collects training data from three sources, and the comparison between them reveals a crucial insight about data quality:

Source	Hours	Method	FVD ↓
Random Slow	~1,000	Random mouse/keyboard at slow pace	48.17
Random Fast	~400	Random mouse/keyboard at fast pace	20.37
Claude CUA	~110	Claude Computer Use Agent performing real tasks	14.72

110 hours of supervised data beats 1,400 hours of random data. Claude CUA data has an FVD of 14.72 versus 20.37 for Random Fast and 48.17 for Random Slow. The supervised data is 10× smaller but 3× better. Why? Because Claude performs purposeful actions — opening apps, clicking menus, typing commands. Random actions produce incoherent trajectories that teach the model nothing about the causal structure of a desktop environment.

Cursor supervision: a microcosm of the NC challenge

One surprisingly hard sub-problem: getting the model to render the cursor in the right position. This is a microcosm of the entire Neural Computer challenge — how do you ground abstract spatial information (coordinates) into pixel-level rendering?

The cursor is a small visual element (typically 12×19 pixels) that must appear at a precise location determined by the user's mouse input. In a real computer, this is trivial: the OS composites the cursor sprite at (x,y). In an NC, the model must learn that the action embedding "mouse at (542, 387)" means drawing a cursor at a specific pixel location. This is surprisingly difficult because the mapping from coordinate space to pixel space is arbitrary — it depends on the screen resolution, the VAE downsampling factor, and the cursor style.

The paper tries three progressively better supervision strategies:

Strategy	Cursor accuracy	How it works
Position-only	8.7%	Action embedding includes (x,y) coordinates
Position + Fourier	13.5%	Fourier-encode coordinates for finer spatial signal
SVG mask/reference	98.7%	Render cursor as an SVG overlay, give model a reference frame with cursor already drawn

The jump from 13.5% to 98.7% is remarkable. Spatial coordinates alone are too abstract for the model to reliably ground into pixel space. But showing the model a reference frame with the cursor already rendered at the correct position? That's a visual signal the model can copy directly. Pixel-level supervision dominates abstract supervision.

Meta-action vs raw-action

Actions can be injected as raw events (mouse_move(x,y), key_press('a'), click(left)) or as meta-actions (API-like commands: type_text("hello"), click_element("OK button")). Meta-actions slightly outperform: SSIM 0.863 vs 0.847. The structured, higher-level representation is easier for the model to condition on.

Think of it this way: Raw actions are like assembly language — low-level, precise, hard to learn patterns from. Meta-actions are like a high-level API — they carry more semantic information per token. When you say click_element("OK button"), the model knows both the action (click) and the target (a button labeled OK). When you say click(542, 387), the model has to figure out what's at those coordinates by itself.

Resolution and frame rate

NCGUIWorld targets 1024×768 at 15 FPS. This is deliberately conservative — modern desktops run at 1920×1080 or higher at 60 FPS. The lower resolution reduces the latent space size (fewer visual tokens), and the lower frame rate means fewer denoising steps per second. Even at these reduced settings, each frame requires a full diffusion inference pass. Real-time interaction requires either much faster models or clever scheduling (e.g., predicting multiple frames ahead while the user hasn't acted).

Why does 110 hours of Claude CUA data beat 1,400 hours of random interaction data?

Because Claude generates higher-resolution screenshots Because purposeful actions from Claude reveal the causal structure of the desktop — menus open when clicked, text appears when typed — while random actions produce incoherent trajectories with no learnable cause-effect patterns Because random data has too many duplicate frames

Chapter 5: Action Injection Modes

The most detailed engineering contribution in the paper is the comparison of four ways to inject user actions into the video generation backbone. This is the central design question for any interactive Neural Computer: where and how does user input enter the diffusion process?

All four modes take the same input (action features) and target the same backbone (the DiT transformer blocks). They differ in where the action signal is injected — before, alongside, around, or inside the transformer.

Mode 1: External

Action features modulate the VAE latents before they enter the transformer. Think of it as adjusting the input signal before it hits the processing pipeline. The action is encoded into a feature vector, which is added to or element-wise multiplied with the initial latent z_t. After this modulation, the transformer processes the modified latents normally — it never directly "sees" the action, only its residual effect on the input.

Result: SSIM 0.746, FVD 33.4. The worst performer. By the time action information propagates through 30+ transformer layers, it's been diluted beyond recognition. The early layers overwrite the subtle modulation with their own learned representations.

Mode 2: Contextual

Action tokens are concatenated with visual tokens in the self-attention sequence. If the visual sequence has 768 tokens and the action has 16 tokens, the self-attention now operates on 784 tokens. The transformer sees both visual patches and action tokens, and self-attention lets them interact freely. This is analogous to how text conditioning works in many text-to-image diffusion models.

Result: SSIM 0.824, FVD 21.2. Better — the action information is persistent across all layers because it's part of the token sequence. But it competes for attention bandwidth with the much larger set of visual tokens. In a 768-visual + 16-action sequence, each action token gets only ~2% of the attention budget.

Mode 3: Residual

ControlNet-style architecture. A parallel branch of transformer blocks processes action features separately, and their outputs are added as residual connections to the main backbone. The main backbone's weights are frozen; only the residual branch is trained. This is exactly how ControlNet adds spatial conditioning (edges, depth maps) to Stable Diffusion.

Result: SSIM 0.841, FVD 18.6. Good, but the residual branch roughly doubles the parameter count for action processing. The additive injection also means action information and visual information share the same representation space, which can cause interference — the action signal may fight with existing visual features rather than cleanly augmenting them.

Mode 4: Internal (best)

Each transformer block gets a dedicated action cross-attention layer. Visual tokens attend to action features through cross-attention, the same way they attend to text embeddings in standard text-to-image diffusion. The operation is: Q = W_Q · visual_tokens, K = W_K · action_tokens, V = W_V · action_tokens. Each layer independently learns how much to attend to the action and what aspects of the action to incorporate.

Result: SSIM 0.863, FVD 14.5. The clear winner.

This is the cleanest integration: action information enters at every layer through a dedicated pathway that doesn't compete with visual self-attention. The cross-attention weights are small (action sequences are short), so the computational overhead is minimal.

Why Internal wins: Cross-attention gives each transformer block direct, uncontaminated access to the action signal. Every layer can independently decide how much to attend to the action vs. the visual context. In External mode, layers 20-30 barely see the action. In Contextual mode, action tokens fight for attention with hundreds of visual tokens. Internal cross-attention gives action information a private channel at every depth.

Mode	Where action enters	SSIM ↑	FVD ↓
External	Before transformer (modulates latents)	0.746	33.4
Contextual	Alongside visual tokens (self-attn)	0.824	21.2
Residual	Parallel ControlNet-style branch	0.841	18.6
Internal	Dedicated cross-attention per block	0.863	14.5

Why does Internal (cross-attention) injection outperform all other modes?

Because it uses more parameters than the other modes Because dedicated cross-attention at every transformer block gives each layer direct, uncontaminated access to the action signal through a private channel, without competing for self-attention bandwidth Because cross-attention is faster than self-attention

Chapter 6: Key Findings

Beyond the architectural choices, the paper surfaces three findings that reveal what Neural Computers can and cannot do.

Finding 1: Symbolic instability

NCCLIGen achieves 40.77 dB PSNR (frames look great) but only 0.54 character accuracy (text is often wrong). The model learns the appearance of terminal output — monospaced text in neat rows — without reliably encoding the content. This is the fundamental gap between perceptual fidelity and symbolic correctness.

For arithmetic, native accuracy is just 4%. Ask the terminal model to compute "2 + 3" and it'll render a convincing-looking answer that's usually wrong. Compare this to Sora2, which achieves 71% on the same task. The NC learns to render, not to reason.

Finding 2: Reprompting recovers accuracy

Here's the twist: when you take the NC's output frame and reprompt the model — "the answer on screen is 5, now render the terminal showing this result" — accuracy jumps from 4% to 83%. The NC is a powerful conditional renderer. It can draw whatever you tell it to draw. It just can't figure out what to draw on its own for symbolic tasks.

The implication: A Neural Computer doesn't need to be a native reasoner. It needs to be a perfect conditionable renderer. Pair it with an external reasoning engine (an LLM, a calculator, a Python interpreter) that decides what to display, and the NC renders it. This is the hybrid architecture the paper implicitly points toward.

Finding 3: Training metrics plateau

PSNR and SSIM plateau around 25,000 training steps. After that, more compute yields diminishing returns on visual quality. But character-level accuracy continues improving slowly past 60,000 steps. The model learns coarse visual structure first, then slowly refines symbolic detail — a classic curriculum effect.

The training curriculum in hindsight: Early training (0-10k steps): the model learns that terminals are dark backgrounds with light text in rows. Mid training (10k-25k): character shapes become crisp, line spacing becomes consistent, PSNR/SSIM saturate. Late training (25k-60k+): subtle improvements in character identity (distinguishing 0 from O, 1 from l, m from rn). Visual structure is easy; symbolic fidelity is hard.

The reprompting insight deeper

The 4% → 83% jump via reprompting deserves more attention. It means the NC's bottleneck is not rendering capacity but reasoning capacity. If you tell it the answer, it can render it. If you ask it to compute the answer, it fails. This suggests a clean division of labor in future systems: use an LLM or symbolic engine to decide what to display, and use the NC to display it. The NC becomes a universal learned display driver.

How does reprompting improve NCCLIGen's arithmetic accuracy from 4% to 83%?

By fine-tuning the model on arithmetic examples By explicitly telling the model what the correct answer is and asking it to render that — proving it's a strong conditional renderer even though it can't reason natively By running the diffusion process for more denoising steps

Chapter 7: The CNC Vision

The paper doesn't stop at prototypes. Section 4 lays out a roadmap for a Completely Neural Computer (CNC) — a system that would replace a conventional computer entirely. The authors define four requirements for completeness:

(i) Turing complete

The NC must be able to simulate any computation that a Turing machine can perform. In theory, a sufficiently large neural network is a universal function approximator, so this is achievable. In practice, the current prototypes can't reliably execute multi-step algorithms. A Turing machine needs reliable state transitions — if state A with input 1 goes to state B, it must always go to state B. Current NCs have no such guarantee.

(ii) Universally programmable

A user must be able to specify any task through the NC's input interface. Current NCs accept text prompts and action sequences, which is flexible but not equivalent to a programming language. You can't write a for-loop in a mouse click.

(iii) Behavior-consistent

The NC must produce the same output for the same input unless explicitly reprogrammed. Deterministic behavior is the default expectation for computers. But diffusion models are inherently stochastic — the same prompt produces different frames each time. Achieving consistency requires either deterministic sampling or learned invariance.

(iv) Architectural advantages

A CNC must offer something a conventional computer cannot. The paper identifies several candidates:

Natural language programming — describe what you want in English, skip the compiler
Graceful degradation — a conventional computer crashes on invalid input; an NC produces its best approximation
Transfer learning — train on terminal tasks, fine-tune on desktop tasks with minimal data
Continuous state — analog memory with infinite precision, not quantized to 32/64 bits

How far are we? The paper is honest: current NCs satisfy none of the four requirements fully. They can render convincing frames (partial I/O alignment), respond to short action sequences (partial control), and maintain visual coherence over short horizons (partial consistency). But reliable symbolic computation, long-horizon planning, and stable capability reuse remain open problems.

CNC Requirement	Status	Bottleneck
(i) Turing complete	Theoretical only	Can't execute multi-step algorithms reliably
(ii) Universally programmable	Partial	Text/action input is flexible but not a programming language
(iii) Behavior-consistent	Weak	Diffusion stochasticity — same input → different output
(iv) Architectural advantage	Promising	Natural language input + graceful degradation demonstrated

The path forward: The paper suggests that scaling alone won't solve (i) and (iii). Turing completeness may require hybrid architectures (NC + symbolic engine). Behavior consistency may require deterministic sampling or learned invariance losses. The most achievable near-term goal is (iv) — building NCs that offer capabilities conventional computers don't, even if they can't fully replace them.

Which CNC requirement is most fundamentally at odds with how diffusion models work?

Behavior-consistency — diffusion models are inherently stochastic, producing different outputs for the same input, while computers must be deterministic by default Turing completeness Universal programmability

Chapter 8: What NCs Can't Do

Intellectual honesty is critical here. The paper presents exciting prototypes, but the gap between "generates convincing terminal video" and "replaces a computer" is vast. Let's be precise about what's missing.

Reliable symbolic computation

4% arithmetic accuracy means the NC cannot perform the most basic computational task. Even with reprompting (83%), you need an external oracle to provide the correct answer. A computer that needs another computer to tell it what to compute is not yet a computer.

Long-horizon reasoning

Current video models generate 5-10 seconds of coherent video. A typical computing session lasts hours. State information in the latent representation degrades over time — the model forgets what was on screen 30 frames ago. There's no mechanism for persistent memory beyond the rolling latent window.

Stable capability reuse

If you teach the NC to render a text editor, that capability doesn't compose with rendering a file browser. Each skill must be independently trained. A conventional computer's composability — any program can use any other program — has no analogue in the current NC architecture.

Explicit runtime governance

You can't inspect what the NC is "thinking." The latent state h_t is a high-dimensional vector (hundreds of thousands of continuous values) with no interpretable structure. You can't set breakpoints, examine memory, or debug a Neural Computer. If it renders the wrong output, you can't trace why — which of the 885K latent values caused the error? You can only retrain or reprompt.

This is perhaps the most practically devastating limitation. Every useful computer needs a way to diagnose failures. When your real computer crashes, you get an error message, a stack trace, a core dump. When a Neural Computer produces a wrong frame, you get... a wrong frame. There's no error channel. The model doesn't know it's wrong.

The honest summary: Neural Computers are currently strong renderers and weak reasoners. They can produce visually faithful simulations of computing interfaces. They cannot perform the actual computation those interfaces represent. The gap between rendering a terminal and being a terminal is the entire field of symbolic AI.

The scaling question

Can you solve these limitations by simply making the model bigger and training on more data? The paper's evidence suggests no, at least not for symbolic computation. NCCLIGen was trained on ~1,100 hours of terminal data, yet arithmetic accuracy is 4%. The model has seen millions of arithmetic examples in terminal outputs — it has the data. The architecture is the bottleneck: diffusion models generate pixels, not symbols. They optimize for perceptual similarity (PSNR, SSIM), not symbolic correctness. No amount of scaling changes what the loss function rewards.

What would need to change

Memory: External memory modules (like Neural Turing Machines or memory-augmented networks) that persist beyond the latent window. Current video models have ~5-10 second context. A useful computer needs hours.
Reasoning: Integration with symbolic engines for arithmetic, logic, and planning. The NC renders; the symbolic engine computes. Like pairing a GPU (rendering) with a CPU (logic).
Composability: Modular architectures where learned "programs" can call each other. If the NC learns to render a text editor and a file browser, it should be able to compose them into "open file in editor" without retraining.
Interpretability: Latent spaces with structure that maps to human-understandable concepts. Can we find "the cursor position dimension" or "the text content dimension" in h_t? Without this, debugging a Neural Computer is impossible.
Consistency losses: Training objectives that explicitly penalize different outputs for identical inputs. Current diffusion losses only care about reconstruction quality, not determinism.

An analogy to early electronics: The first electronic computers (ENIAC, 1946) could compute, but they were room-sized, unreliable, and required physical rewiring to change programs. Neural Computers in 2026 are at a similar stage: they prove the concept works, but the engineering isn't there yet. The gap between ENIAC and your laptop took 80 years. The gap between current NCs and a CNC may be shorter — or may require entirely new ideas we haven't had yet.

What is the fundamental gap between current Neural Computers and actual computers?

Current NCs are too slow to be practical NCs are strong renderers but weak reasoners — they can visually simulate computing interfaces but cannot perform the symbolic computation those interfaces represent NCs require too much GPU memory

Chapter 9: Connections

Neural Computers sit at the intersection of several research threads. Understanding where they fit helps you see what comes next.

Diffusion Transformers (DiT)

The entire NC framework depends on DiT as the backbone. The DiT provides the state-update function F_θ — it's the "processor" of the Neural Computer. Advances in DiT efficiency (faster sampling, better latent spaces, longer context windows) directly translate to better NCs. The paper's prototypes use Wan2.1, but any sufficiently capable video DiT could serve as the substrate.

Stable Diffusion 3 and video generation

SD3's rectified flow formulation and MMDiT architecture represent the current frontier of diffusion-based generation. NCCLIGen and NCGUIWorld build on this lineage — the VAE, the text conditioning, the classifier-free guidance all trace back to the SD/SDXL/SD3 line. As video generation models improve in temporal coherence and resolution, NCs automatically inherit those gains.

World models and game simulation

NCs are closely related to world models like GameGen, Genie, and DIAMOND, which simulate game environments as video. Google's Genie (2024) showed that a single model could generate playable 2D platformer levels from a single image. DIAMOND (2024) learned to simulate Atari games with enough fidelity to train RL agents in the simulated environment.

The key difference: games have well-defined physics and reward signals. A ball falls at 9.8 m/s². A coin gives +100 points. Computing interfaces have no physics — the "rules" are the arbitrary conventions of operating system design. Why does clicking "File" open a dropdown? Because someone programmed it that way, not because of any physical law. NCs must learn these conventions purely from observation, which makes the learning problem harder.

Computer Use Agents

Claude CUA, GPT-4V browsing, and similar systems use real computers through screenshots and action APIs. NCs approach the same problem from the opposite direction: instead of an AI controlling a real computer, the AI becomes the computer. A future hybrid might use an LLM for reasoning and an NC for rendering, getting the best of both.

OneVL and video prediction as auxiliary

OneVL uses video prediction as an auxiliary training signal to improve visual understanding. NCs take the same idea to its logical extreme: video prediction is the entire system, not just an auxiliary task. Both approaches validate the thesis that learning to predict future frames builds deep understanding of visual dynamics.

System	Approach	Relationship to NCs
DiT	Diffusion backbone	The "processor" inside every NC
SD3/Wan2.1	Image/video generation	The generation framework NCs build on
Genie/DIAMOND	Game world models	Same idea for games, NCs extend to computing
Claude CUA	AI controlling real OS	Inverse approach — AI uses computer vs. AI is computer
OneVL	Video prediction as auxiliary	NCs make video prediction the whole system

Neural Turing Machines and Differentiable Computers

Neural Computers are not the first attempt to make neural networks act as computers. Graves et al. (2014) built the Neural Turing Machine (NTM), which added external memory banks with differentiable read/write heads. The follow-up, Differentiable Neural Computer (DNC, 2016), improved memory allocation and linking. These systems could learn algorithms like sorting and copying from input-output examples.

The difference: NTMs and DNCs operate on abstract symbols in an external memory. NCs operate on pixels in a visual output space. NTMs compute; NCs render. The paper's NCs are closer to "learned display servers" than to "learned CPUs." A future synthesis might combine the symbolic capability of NTMs with the visual rendering of NCs.

The bigger picture: Neural Computers represent a philosophical boundary in AI — the point where the tool becomes the medium. If a video model can learn to simulate a computer well enough, the distinction between "running on hardware" and "being the hardware" blurs. We're not there yet, but these prototypes show the direction of travel. The paper's most lasting contribution may not be the prototypes themselves, but the formal framework (Eq 2.1) that gives future work a shared language for describing learned computing systems.

How do Neural Computers relate to Computer Use Agents like Claude CUA?

They are identical systems with different names They approach the same problem from opposite directions — CUA controls a real computer through screenshots/actions, while NCs replace the computer entirely by generating the visual output directly NCs are a subset of Computer Use Agents