Encode geometry AND PBR materials into sparse voxels, compress them 16× into ~9.6K latent tokens, then generate high-fidelity 3D assets with a 4B-parameter flow-matching model in under a minute.
You want to generate a 3D asset from a single image. Not just a blobby shape — a detailed mesh with proper materials: shiny metal here, rough wood there, translucent glass in the middle. Something a game engine can render under any lighting.
The bottleneck is representation. Before you can generate anything, you need a way to encode 3D assets that neural networks can process. And every existing choice has painful limitations.
Most large 3D generation models use signed distance functions (SDFs) or occupancy fields to represent geometry. An SDF assigns every point in 3D space a number: negative inside the surface, positive outside, zero on the surface. You extract the mesh by finding the zero-crossing.
This sounds elegant, but it makes a fatal assumption: the surface must be watertight and manifold. Every point must have a clear "inside" and "outside." Real-world 3D assets violate this constantly:
Even if geometry were solved, most 3D generation models ignore materials. They generate a shape and slap on a diffuse color texture. But real assets have physically-based rendering (PBR) attributes: metallic surfaces reflect light differently than plastic; rough surfaces scatter light; glass is translucent. Without PBR, generated assets look flat and fake under novel lighting.
TRELLIS (the predecessor) tried to handle appearance by baking multiview 2D features into 3D Gaussians. But this is rendering-dependent — the appearance is view-dependent, not intrinsic to the surface. You can't re-light a Gaussian splat the way you can a mesh with PBR materials.
SDFs require a clear inside/outside. Drag the slider to see how open surfaces and non-manifold geometry break the SDF assumption. The red region shows where the SDF is undefined or produces artifacts.
TRELLIS.2's insight is deceptively simple: don't use fields at all. Instead, represent 3D assets directly as sparse voxels that store both geometry and materials — then compress those voxels into a compact latent space.
O-Voxel ("omni-voxel") is the foundation of everything. It's a sparse voxel structure on an N×N×N grid where each active voxel stores a feature tuple:
where pi ∈ {0, 1, ..., N-1}3 is the voxel coordinate, and only voxels intersecting the mesh surface are active. Let's unpack each component.
For geometry, each active voxel stores three things:
| Feature | Symbol | Shape | Meaning |
|---|---|---|---|
| Dual vertex | vi | R3[0,1] | A point within the voxel representing local surface position |
| Edge intersection flags | δi | {0,1}3 | Which of the 3 canonical edges (X, Y, Z) the mesh intersects |
| Splitting weights | γi | R>0 | Controls how quad faces split into triangles |
The idea comes from Dual Contouring (DC), but with a crucial difference: DC requires a signed distance field to detect sign changes across edges. O-Voxel skips the field entirely. It directly tests whether each voxel edge intersects the mesh surface. If an edge crosses a triangle, the adjacent dual face is activated, and Hermite data (intersection point + normal) positions the dual vertex.
The dual vertex position is computed by minimizing a Quadratic Error Function (QEF):
Three terms, each serving a purpose:
Each active voxel also stores PBR material attributes:
| Attribute | Symbol | Range | Meaning |
|---|---|---|---|
| Base color | ci | R3[0,1] | RGB albedo (diffuse + specular base) |
| Metallic | mi | R[0,1] | 0 = dielectric (plastic, wood), 1 = metal |
| Roughness | ri | R[0,1] | 0 = mirror-smooth, 1 = fully diffuse |
| Opacity | αi | R[0,1] | 0 = fully transparent, 1 = opaque |
Materials are sampled by projecting each voxel center onto intersected triangles and reading from the texture map using UV coordinates. The inverse is equally simple: to reconstruct a material at any surface point, trilinear interpolation of neighboring voxel attributes gives the answer. No baking, no optimization.
O-Voxel gives us a native 3D representation. But a 1024³ grid with ~600K active voxels is too large for a generative model to work with directly. We need to compress it into a compact latent space — that's the job of the Sparse Compression VAE (SC-VAE).
Unlike TRELLIS (which used transformers), SC-VAE is a fully sparse-convolutional U-shaped network. It processes only active voxels, skipping the ~99.4% of the grid that's empty. The encoder downsamples hierarchically through residual blocks; the decoder mirrors this for reconstruction.
The key innovation enabling 16× compression is Sparse Residual Autoencoding, adapted from DC-AE (originally designed for 2D images). The problem: at 16× downsampling, you're squeezing 163 = 4096 spatial voxels into a single latent. Standard pooling destroys too much information.
The solution: before each 2× downsample, rearrange the 8 child voxels of each parent into the channel dimension:
This is a non-parametric shortcut: spatial information is preserved in channels, then averaged across groups. The learnable conv layers only need to refine a residual on top of this estimate. During upsampling, the symmetric operation distributes channels back to spatial positions:
Sparse convolutions are computationally expensive at high sparsity. SC-VAE replaces the standard two-conv residual block with a ConvNeXt-inspired design: one sparse conv layer followed by a wide point-wise MLP (analogous to a Transformer FFN). This doesn't change runtime but improves reconstruction quality: MD drops 16%, PSNR gains 0.6dB.
During decoding, not all child voxels of a parent should be active. Before each upsample step, the network predicts a binary mask ρ̂ ∈ {0,1}8 for each parent, specifying which children to activate. Inactive children are pruned, saving both computation and memory.
Watch how a 1024³ voxel grid compresses through 4 stages of 2× downsampling. Each stage halves spatial resolution while doubling channels. The residual shortcut preserves structure at each step.
SC-VAE is trained in two stages:
Let's put the compression in perspective. How does TRELLIS.2's latent space compare to every other 3D generation method?
Every 3D latent representation faces a tradeoff: more tokens give better reconstruction quality, but make generation slower and harder. The ideal method sits in the top-left corner: few tokens, high quality.
| Method | Resolution | Downsample | #Tokens | #Dims (total) | Decode Time |
|---|---|---|---|---|---|
| Dora | — | — | 2.0K | 131K | 37.7s |
| TRELLIS | — | 4× | 9.6K | 77K | 0.108s |
| Direct3D-S2 512 | 512³ | 8× | 3.0K | 48K | 1.86s |
| Direct3D-S2 1024 | 1024³ | 8× | 17K | 271K | 13.0s |
| SparseFlex 512 | 512³ | 4× | 56K | 452K | — |
| SparseFlex 1024 | 1024³ | 4× | 225K | 1799K | — |
| Ours 512 | 512³ | 16× | 2.2K | 70K | — |
| Ours 1024 | 1024³ | 16× | 9.6K | 306K | — |
At 512³, TRELLIS.2 uses only 2.2K tokens — 25× fewer than SparseFlex. At 1024³, it uses 9.6K tokens — 23× fewer than SparseFlex and nearly 2× fewer than Direct3D-S2. Yet its reconstruction quality is substantially better across every metric.
Interactive bar chart comparing token counts and reconstruction quality (Normal PSNR) across methods. Toggle resolution to compare at 512³ and 1024³. The ideal method sits in the bottom-right: fewer tokens, higher PSNR.
The ablation on Sketchfab assets at 256³ tells the full story:
| Setting | Downsample | #Tokens | MD ↓ | PSNR ↑ |
|---|---|---|---|---|
| SC-VAE f16c32 | 16× | 503 | 1.032 | 27.26 |
| w/o Residual AE | 16× | 503 | 1.747 (+69%) | 26.73 |
| w/o Opt. ResBlock | 16× | 503 | 1.198 (+16%) | 26.67 |
| SC-VAE f32c128 | 32× | 118 | 1.405 | 26.65 |
| w/o Residual AE (32×) | 32× | 118 | 7.394 (+526%) | 25.01 |
At 16×, residual AE keeps quality high. At 32×, even with residual AE, quality degrades. Without it, 32× is catastrophic. The 16× sweet spot delivers production-quality reconstruction with a manageable token count.
With the latent space defined, we need a generative model to sample from it. TRELLIS.2 uses flow matching — a cleaner alternative to diffusion that learns straight-line transport from noise to data.
Generation unfolds in three sequential stages, each with its own DiT (Diffusion Transformer):
Each DiT module uses:
The models are trained progressively:
This progressive strategy transfers learned priors across resolutions, enabling efficient training of the large sparse DiTs while maintaining fidelity.
Training a 4B-parameter 3D generation system requires careful data curation, infrastructure, and a multi-stage training protocol.
| Component | Dataset | Size | Purpose |
|---|---|---|---|
| SC-VAE | Trellis-500K (filtered) | ~500K assets | Curated from Objaverse-XL, ABO, HSSD. Only assets with PBR materials kept. |
| DiT generators | Extended collection | ~800K assets | Augmented with TexVerse for PBR diversity and realism |
| Image prompts | Rendered views | 16 views/asset | Rendered in Blender with randomized FoVs and lighting |
| Evaluation (reconstruction) | Toys4K + Sketchfab Featured | ~4090 + 90 | Unseen during training. Sketchfab: complex PBR and detailed shapes |
| Evaluation (generation) | AI-generated prompts | 100 images | Ensures train-test disjointness |
Two-stage training, as described in Chapter 3:
TRELLIS.2 is evaluated on reconstruction fidelity, generation quality, and inference speed. The results are consistently dominant.
On the Toys4K benchmark (shape reconstruction):
| Method | #Tokens | Downsample | MD ↓ | F1 ↑ | PSNR ↑ |
|---|---|---|---|---|---|
| Dora (2K) | 2.0K | — | 366.1 | 0.019 | 22.02 |
| TRELLIS | 9.6K | 4× | 85.07 | 0.074 | 24.31 |
| Direct3D-S2 1024 | 17K | 8× | 73.17 | 0.001 | 23.82 |
| SparseFlex 1024 | 225K | 4× | 0.313 | 0.845 | 32.12 |
| Ours 512 | 2.2K | 16× | 0.032 | 0.888 | 31.00 |
| Ours 1024 | 9.6K | 16× | 0.004 | 0.971 | 35.26 |
At 1024³, TRELLIS.2 achieves 8× lower mesh distance than SparseFlex (which uses 23× more tokens). Even at 512³ with only 2.2K tokens, it outperforms methods using 10-100× more tokens.
Image-to-3D generation compared against TRELLIS, Hi3DGen, Direct3D-S2, Step1X-3D, and Hunyuan3D 2.1:
| Resolution | Shape Time | Texture Time | Total |
|---|---|---|---|
| 512³ | ~2s | ~1s | ~3s |
| 1024³ | ~10s | ~7s | ~17s |
| 1536³ | ~35s | ~25s | ~60s |
All timings on a single NVIDIA H100 GPU. These are significantly faster than existing large 3D generation models, despite producing higher quality outputs with PBR materials.
Inference speed across resolutions, showing how the compact latent space enables fast generation even at 1536³. Click bars to see detailed breakdowns.
The compact latent space enables two forms of test-time scaling:
TRELLIS.2's native 3D output with PBR materials opens several practical application paths that prior methods cannot support.
The primary use case: given a single image, generate a complete 3D asset ready for rendering. The output includes a mesh with proper topology (open surfaces, enclosed interiors supported) plus full PBR materials (base color, metallic, roughness, opacity). The asset is immediately usable in game engines, film pipelines, and AR applications without post-processing.
Stage 3 (material generation) can run independently. Given any input mesh and a reference image, it produces PBR materials aligned to the geometry. This beats alternatives:
The opacity channel α in O-Voxel enables translucent surface modeling — glass, ice, thin fabric. No prior 3D generation method handles this. The generated assets can be rendered with proper alpha blending in any standard pipeline.
Because the latent space is resolution-agnostic (thanks to RoPE), a model trained at 1024³ can be applied to generate at 1536³ through cascaded inference. This requires no retraining — just re-running Stage 2 on a downsampled higher-res structure. The result: finer geometric details and sharper materials at the cost of additional inference time (~60s total for 1536³).
TRELLIS.2 sits at the intersection of 3D representation learning, latent compression, and large-scale generative modeling. Let's map where it fits.
TRELLIS introduced the Structured Latent (SLAT) representation for joint geometry-appearance modeling. But SLAT was built from multiview 2D features with rendering-based supervision, limiting its ability to capture complex structures and materials. TRELLIS.2 replaces this with native 3D data (O-Voxel), achieving 16× vs 4× compression, better quality, and proper PBR support. The generative pipeline (sparse DiT + flow matching) is evolutionary; the representation is revolutionary.
NeRFs and 3DGS represent appearance as view-dependent radiance — great for novel view synthesis, but the appearance is baked to specific lighting. O-Voxel stores intrinsic material properties (base color, metallic, roughness) that can be re-lit in any environment. The tradeoff: O-Voxel requires PBR-annotated training data, while NeRF/3DGS can learn from posed images alone.
DC-AE demonstrated that residual autoencoding enables extreme spatial compression in 2D image VAEs. TRELLIS.2 adapts this principle to sparse 3D voxels — the space-to-channel rearrangement and non-parametric shortcuts are directly inherited. The key extension: handling sparsity (most voxels are empty, children may not exist).
The generation backbone uses flow matching rather than DDPM-style diffusion. Flow matching learns straight-line transport from noise to data, enabling fewer sampling steps. The DiT architecture follows the standard recipe (AdaLN, RoPE), with the novelty being its application to structured sparse 3D latents rather than 2D image tokens.
| Aspect | TRELLIS.2 |
|---|---|
| Representation | O-Voxel (field-free sparse voxels with geometry + PBR) |
| Compression | SC-VAE, 16× spatial downsample, ~9.6K tokens @ 1024³ |
| Generator | 3 DiT models (~4B total), flow matching |
| Pipeline | Structure → Shape → Material (sequential) |
| Input | Single image (DINOv3-L features) |
| Output | Mesh + PBR textures (base color, metallic, roughness, alpha) |
| Speed (1024³) | ~17s on H100 (~10s shape + ~7s texture) |
| Speed (512³) | ~3s on H100 |
| Training data | ~800K 3D assets (Objaverse-XL, ABO, HSSD, TexVerse) |
| Training compute | 32 H100 GPUs, batch 256, AdamW lr=1e-4 |
| Key ablation | w/o Residual AE: MD +69%, PSNR -0.5dB |
| vs TRELLIS | 16× vs 4× compression, native 3D vs multiview, PBR vs baked |