TRELLIS.2 — Veanors

Chapter 0: The Problem

You want to generate a 3D asset from a single image. Not just a blobby shape — a detailed mesh with proper materials: shiny metal here, rough wood there, translucent glass in the middle. Something a game engine can render under any lighting.

The bottleneck is representation. Before you can generate anything, you need a way to encode 3D assets that neural networks can process. And every existing choice has painful limitations.

The SDF Trap

Most large 3D generation models use signed distance functions (SDFs) or occupancy fields to represent geometry. An SDF assigns every point in 3D space a number: negative inside the surface, positive outside, zero on the surface. You extract the mesh by finding the zero-crossing.

This sounds elegant, but it makes a fatal assumption: the surface must be watertight and manifold. Every point must have a clear "inside" and "outside." Real-world 3D assets violate this constantly:

Open surfaces: A leaf, a sheet of paper, a curtain. These have no "inside" — they're infinitely thin.
Non-manifold geometry: Two surfaces sharing an edge (a T-junction), or self-intersecting meshes. Common in CAD models and game assets.
Enclosed interiors: A car with a visible interior cabin, or a building you can walk into. SDFs can only represent the outermost surface.

The fundamental limitation: SDFs, Flexicubes, and all field-based representations require every surface to divide space into "inside" and "outside." Open surfaces, T-junctions, and enclosed interiors have no such division. Preprocessing these assets (closing holes, removing intersections) is lossy — you destroy the very geometry you wanted to capture.

The Appearance Gap

Even if geometry were solved, most 3D generation models ignore materials. They generate a shape and slap on a diffuse color texture. But real assets have physically-based rendering (PBR) attributes: metallic surfaces reflect light differently than plastic; rough surfaces scatter light; glass is translucent. Without PBR, generated assets look flat and fake under novel lighting.

TRELLIS (the predecessor) tried to handle appearance by baking multiview 2D features into 3D Gaussians. But this is rendering-dependent — the appearance is view-dependent, not intrinsic to the surface. You can't re-light a Gaussian splat the way you can a mesh with PBR materials.

Full data flow at a glance: Input image I ∈ R^H×W×3 → DINOv3-L features → Stage 1: predict sparse structure layout (which voxels are active) → Stage 2: generate geometry latents within active voxels → SC-VAE decoder → O-Voxel shape (dual vertices + edge flags) → Stage 3: generate material latents conditioned on geometry → SC-VAE decoder → O-Voxel materials (base color + metallic + roughness + alpha) → extract mesh + PBR textures. Total: ~17s for 1024³ on H100.

Topology Failures of SDF

SDFs require a clear inside/outside. Drag the slider to see how open surfaces and non-manifold geometry break the SDF assumption. The red region shows where the SDF is undefined or produces artifacts.

Topology Watertight

Why can't SDFs represent a sheet of paper (an open surface)?

Because SDFs require every surface to divide space into "inside" and "outside," but an infinitely thin open surface has no interior — the sign is undefined Because SDFs can only represent convex shapes Because the resolution of the SDF grid is too low

Chapter 1: The Key Insight

TRELLIS.2's insight is deceptively simple: don't use fields at all. Instead, represent 3D assets directly as sparse voxels that store both geometry and materials — then compress those voxels into a compact latent space.

Step 1: Native 3D Encoding

Convert any mesh (open, non-manifold, enclosed) into O-Voxels: sparse voxels on an N×N×N grid. Each voxel stores dual vertex position + edge flags (geometry) and base color + metallic + roughness + alpha (materials). No SDF, no field. Direct mesh-to-voxel conversion in seconds.

↓

Step 2: Compress 16×

A Sparse Compression VAE with residual autoencoding downsamples the voxel grid 16× spatially. A 1024³ asset with ~600K active voxels becomes ~9.6K latent tokens. That's 10× fewer tokens than original TRELLIS, and 60× fewer than SparseFlex at 1024³.

↓

Step 3: Generate in Latent Space

Three DiT-based flow-matching models (total ~4B parameters) generate structure, geometry, and materials sequentially. Image-conditioned. ~17s for a full 1024³ PBR asset on H100.

Why "field-free" matters: By skipping the SDF/occupancy field entirely, O-Voxel sidesteps all topology constraints. The mesh surface directly determines which voxels are active and where dual vertices go. No sign computation, no flood-fill, no watertight assumption. Open surfaces, T-junctions, enclosed interiors — all handled natively. The conversion is also instant: mesh → O-Voxel in seconds on CPU, O-Voxel → mesh in tens of milliseconds.

Why 16× and not 8× or 32×? Prior sparse voxel methods (TRELLIS, SparseFlex) achieve only 4× spatial downsampling. Their residual blocks can't compress further without severe quality loss. TRELLIS.2's residual autoencoding (adapted from DC-AE for 2D images) rearranges spatial information into channel dimensions before downsampling, enabling 16× compression with negligible perceptual degradation. At 32×, quality does degrade — MD increases 526% and PSNR drops 1.6dB. 16× is the sweet spot: compact enough for efficient generation, faithful enough for production quality.

What degrades without each component: Remove residual autoencoding → MD increases 69%, PSNR drops 0.5dB at 16×. Remove optimized ConvNeXt-style residual blocks → MD increases 16%, PSNR drops 0.6dB. Remove rendering loss in stage 2 training → material fidelity drops significantly (no perceptual supervision). Remove PBR attributes entirely → assets can't be re-lit, look flat under novel lighting.

What is the key architectural difference between TRELLIS.2 and its predecessor TRELLIS?

TRELLIS.2 encodes geometry and PBR materials directly from native 3D data (no SDF, no multiview baking), and achieves 16× spatial compression vs TRELLIS's 4×, resulting in ~9.6K tokens vs ~20K TRELLIS.2 uses a larger transformer backbone TRELLIS.2 generates at higher resolution

Chapter 2: O-Voxel Representation

O-Voxel ("omni-voxel") is the foundation of everything. It's a sparse voxel structure on an N×N×N grid where each active voxel stores a feature tuple:

f = {(f^shape_i, f^mat_i, p_i)}^L_i=1

where p_i ∈ {0, 1, ..., N-1}³ is the voxel coordinate, and only voxels intersecting the mesh surface are active. Let's unpack each component.

Shape: The Flexible Dual Grid

For geometry, each active voxel stores three things:

Feature	Symbol	Shape	Meaning
Dual vertex	v_i	R³_[0,1]	A point within the voxel representing local surface position
Edge intersection flags	δ_i	{0,1}³	Which of the 3 canonical edges (X, Y, Z) the mesh intersects
Splitting weights	γ_i	R_>0	Controls how quad faces split into triangles

The idea comes from Dual Contouring (DC), but with a crucial difference: DC requires a signed distance field to detect sign changes across edges. O-Voxel skips the field entirely. It directly tests whether each voxel edge intersects the mesh surface. If an edge crosses a triangle, the adjacent dual face is activated, and Hermite data (intersection point + normal) positions the dual vertex.

The dual vertex position is computed by minimizing a Quadratic Error Function (QEF):

min_v∈voxel e(v) = Σ_i d²_Π,i + λ_bound Σ_j d²_L,j + λ_reg d²_q̂

Three terms, each serving a purpose:

Plane distance d²_Π: The original DC objective — place v close to the tangent planes defined by intersection points and normals. This captures smooth surfaces.
Boundary distance d²_L: New in O-Voxel — penalizes distance from v to boundary edges of the mesh. This handles open surfaces where there's no sign change to guide placement.
Regularization d²_q̂: Keeps v near the average of intersection points. Stabilizes the QEF against degenerate configurations.

Mesh → O-Voxel (instant, CPU-only): (1) Rasterize mesh triangles onto the N³ grid to find all voxel edges that intersect the surface. (2) Mark neighboring voxels as active. (3) Compute Hermite data (intersection points + normals) analytically from triangle geometry. (4) Solve QEF (Eq. above) in closed form for each active voxel. Total time: a few seconds on a single CPU for 1024³. No optimization loop, no GPU needed.

Material: Volumetric Surface Attributes

Each active voxel also stores PBR material attributes:

f^mat_i = (c_i, m_i, r_i, α_i)

Attribute	Symbol	Range	Meaning
Base color	c_i	R³_[0,1]	RGB albedo (diffuse + specular base)
Metallic	m_i	R_[0,1]	0 = dielectric (plastic, wood), 1 = metal
Roughness	r_i	R_[0,1]	0 = mirror-smooth, 1 = fully diffuse
Opacity	α_i	R_[0,1]	0 = fully transparent, 1 = opaque

Materials are sampled by projecting each voxel center onto intersected triangles and reading from the texture map using UV coordinates. The inverse is equally simple: to reconstruct a material at any surface point, trilinear interpolation of neighboring voxel attributes gives the answer. No baking, no optimization.

Per-voxel feature dimensionality: Shape features: 3 (dual vertex) + 3 (edge flags) + 1 (splitting weight) = 7. Material features: 3 (base color) + 1 (metallic) + 1 (roughness) + 1 (opacity) = 6. Total: 13 channels per active voxel. For a 1024³ asset with ~600K active voxels, that's ~600K × 13 ≈ 7.8M floats before compression. After 16× SC-VAE compression: ~9.6K tokens × 32 channels = ~307K floats. A 25× reduction in raw data.

What is the key advantage of O-Voxel's "field-free" design over traditional Dual Contouring?

It uses a higher-resolution grid It directly tests mesh-edge intersections instead of requiring a signed distance field, so it handles open, non-manifold, and enclosed surfaces without any topology constraints It stores more features per voxel

Chapter 3: Sparse Compression VAE

O-Voxel gives us a native 3D representation. But a 1024³ grid with ~600K active voxels is too large for a generative model to work with directly. We need to compress it into a compact latent space — that's the job of the Sparse Compression VAE (SC-VAE).

Architecture: Fully Sparse-Convolutional

Unlike TRELLIS (which used transformers), SC-VAE is a fully sparse-convolutional U-shaped network. It processes only active voxels, skipping the ~99.4% of the grid that's empty. The encoder downsamples hierarchically through residual blocks; the decoder mirrors this for reconstruction.

The Residual Autoencoding Trick

The key innovation enabling 16× compression is Sparse Residual Autoencoding, adapted from DC-AE (originally designed for 2D images). The problem: at 16× downsampling, you're squeezing 16³ = 4096 spatial voxels into a single latent. Standard pooling destroys too much information.

The solution: before each 2× downsample, rearrange the 8 child voxels of each parent into the channel dimension:

F_coarse^raw = stack(F_child1, ..., F_child8) ∈ R^8C

F_coarse = avg_groups(F_coarse^raw) ∈ R^C'

This is a non-parametric shortcut: spatial information is preserved in channels, then averaged across groups. The learnable conv layers only need to refine a residual on top of this estimate. During upsampling, the symmetric operation distributes channels back to spatial positions:

F_fine^raw = unstack(F_coarse) ∈ R^8C/8

F_fine = dup_groups(F_fine^raw) ∈ R^C

Why this works: Without residual shortcuts, the convolution layers must learn to both compress spatial structure AND preserve detail. That's too much to ask at 16×. The residual shortcut handles coarse structure preservation for free (non-parametric averaging), letting the convolutions focus on fine detail refinement. Ablation: removing residual AE at 16× causes MD to increase 69% and PSNR to drop 0.5dB. At 32×, the degradation is catastrophic (526% MD increase) — even residual shortcuts can't save that much compression.

Optimized ConvNeXt-style Blocks

Sparse convolutions are computationally expensive at high sparsity. SC-VAE replaces the standard two-conv residual block with a ConvNeXt-inspired design: one sparse conv layer followed by a wide point-wise MLP (analogous to a Transformer FFN). This doesn't change runtime but improves reconstruction quality: MD drops 16%, PSNR gains 0.6dB.

Early-Pruning Upsampler

During decoding, not all child voxels of a parent should be active. Before each upsample step, the network predicts a binary mask ρ̂ ∈ {0,1}⁸ for each parent, specifying which children to activate. Inactive children are pruned, saving both computation and memory.

Concrete compression numbers: Input: 1024³ O-Voxel with ~600K active voxels × 13 channels. After 16× spatial downsampling (4 stages of 2×): ~9.6K latent tokens × 32 channels = ~307K floats. Spatial resolution shrinks from 1024³ to 64³. Compare: TRELLIS at 4× downsampling produces ~20K tokens. Direct3D-S2 at 8× produces ~17K tokens at 1024³. SparseFlex at 4× produces 225K tokens at 1024³. TRELLIS.2 is 23× more compact than SparseFlex.

Compression Pipeline

Watch how a 1024³ voxel grid compresses through 4 stages of 2× downsampling. Each stage halves spatial resolution while doubling channels. The residual shortcut preserves structure at each step.

Stage Input (1024³)

Two-Stage Training

SC-VAE is trained in two stages:

Stage 1 (low-resolution): Direct O-Voxel reconstruction losses. MSE on dual vertex positions, BCE on edge flags and pruning masks, L1 on materials, plus KL divergence.
Stage 2 (high-resolution): Add rendering-based perceptual supervision. Render mask, depth, normal, and material maps from random camera positions. Supervise with L1 + SSIM + LPIPS. Cameras are placed with shallow near planes to slice through the surface, forcing the model to capture internal structures too.

Decoupled latent spaces: Shape and material are encoded by separate SC-VAEs. The shape VAE encodes geometry features alone. The material VAE encodes material features, conditioned on the shape VAE's subdivision structures during upsampling. This decoupling enables sequential generation: generate shape first, then generate materials conditioned on that shape. It also allows texture-only generation for given shapes.

What is the purpose of the Sparse Residual Autoencoding shortcut?

It rearranges spatial information into channels before downsampling, preserving structure non-parametrically so the conv layers only need to learn a residual refinement — enabling 16× compression without catastrophic quality loss It reduces the number of parameters in the VAE It speeds up inference by skipping empty voxels

Chapter 4: Latent Space (Showcase)

Let's put the compression in perspective. How does TRELLIS.2's latent space compare to every other 3D generation method?

The Token Efficiency Frontier

Every 3D latent representation faces a tradeoff: more tokens give better reconstruction quality, but make generation slower and harder. The ideal method sits in the top-left corner: few tokens, high quality.

Method	Resolution	Downsample	#Tokens	#Dims (total)	Decode Time
Dora	—	—	2.0K	131K	37.7s
TRELLIS	—	4×	9.6K	77K	0.108s
Direct3D-S2 512	512³	8×	3.0K	48K	1.86s
Direct3D-S2 1024	1024³	8×	17K	271K	13.0s
SparseFlex 512	512³	4×	56K	452K	—
SparseFlex 1024	1024³	4×	225K	1799K	—
Ours 512	512³	16×	2.2K	70K	—
Ours 1024	1024³	16×	9.6K	306K	—

At 512³, TRELLIS.2 uses only 2.2K tokens — 25× fewer than SparseFlex. At 1024³, it uses 9.6K tokens — 23× fewer than SparseFlex and nearly 2× fewer than Direct3D-S2. Yet its reconstruction quality is substantially better across every metric.

Why fewer tokens AND better quality? The answer is O-Voxel + Residual Autoencoding. O-Voxel stores geometry and materials directly (no redundant field evaluation), so the raw data is already compact. The residual shortcut preserves most of this information through aggressive downsampling. Other methods start from less efficient representations (full grids, point clouds) and use weaker compression, so they need more tokens to achieve less.

Token Efficiency Comparison

Interactive bar chart comparing token counts and reconstruction quality (Normal PSNR) across methods. Toggle resolution to compare at 512³ and 1024³. The ideal method sits in the bottom-right: fewer tokens, higher PSNR.

Resolution 512³

What Happens at Different Compression Ratios

The ablation on Sketchfab assets at 256³ tells the full story:

Setting	Downsample	#Tokens	MD ↓	PSNR ↑
SC-VAE f16c32	16×	503	1.032	27.26
w/o Residual AE	16×	503	1.747 (+69%)	26.73
w/o Opt. ResBlock	16×	503	1.198 (+16%)	26.67
SC-VAE f32c128	32×	118	1.405	26.65
w/o Residual AE (32×)	32×	118	7.394 (+526%)	25.01

At 16×, residual AE keeps quality high. At 32×, even with residual AE, quality degrades. Without it, 32× is catastrophic. The 16× sweet spot delivers production-quality reconstruction with a manageable token count.

At 1024³ resolution, how does TRELLIS.2's token count compare to SparseFlex?

About the same number of tokens TRELLIS.2 uses ~9.6K tokens vs SparseFlex's ~225K — roughly 23× fewer — while achieving better reconstruction quality on every metric SparseFlex uses fewer tokens because it has higher compression

Chapter 5: Flow Matching for 3D

With the latent space defined, we need a generative model to sample from it. TRELLIS.2 uses flow matching — a cleaner alternative to diffusion that learns straight-line transport from noise to data.

Three-Stage Generation Pipeline

Generation unfolds in three sequential stages, each with its own DiT (Diffusion Transformer):

Stage 1: Sparse Structure

Predict which voxels are active (the occupancy layout). Input: image features from DINOv3-L. Output: binary occupancy grid at 64³ (the latent resolution). This is a classification problem — a lightweight model.

↓

Stage 2: Geometry Generation

Given the active voxel layout, generate geometry latents within active voxels. A sparse DiT takes the layout + image features and produces shape latent tokens via flow matching. SC-VAE shape decoder converts to O-Voxel geometry (dual vertices + edge flags).

↓

Stage 3: Material Generation

Given the generated geometry, generate material latents. A sparse DiT conditioned on image features + geometry latents produces material tokens. SC-VAE material decoder converts to PBR attributes (base color, metallic, roughness, alpha). This is the novel stage — native 3D PBR generation.

DiT Architecture Details

Each DiT module uses:

AdaLN-single modulation: Adaptive Layer Normalization with a single timestep+condition embedding modulating all layers. More parameter-efficient than per-layer conditioning.
Rotary Position Embedding (RoPE): 3D positional encoding that generalizes across resolutions. Crucial for the cascaded inference that scales from 512³ to 1024³ to 1536³.
Image conditioning: DINOv3-L features extracted from the input image, injected via cross-attention.
Classifier-free guidance: 10% condition drop rate during training. At inference, guidance scale controls fidelity-diversity tradeoff.

Architecture scaling: Each DiT has ~1.3B parameters (width: 1536, blocks: 30, heads: 12, MLP width: 8192). Three DiTs total ~3.9B parameters, plus the SC-VAEs and conditioning encoder, reaching ~4B total. Critically, the compact latent space (9.6K tokens at 1024³) means these DiTs process far fewer tokens than if the latent space were larger. A vanilla DiT architecture suffices — no need for the convolutional packing or skip connections that TRELLIS required to handle its 20K tokens efficiently.

Why separate shape and texture? Two reasons. First, shape structure must be known before materials can be spatially aligned to it. Second, the decoupled pipeline enables shape-conditioned texture generation as a standalone tool: given any 3D mesh and a reference image, Stage 3 alone can synthesize PBR materials. This is impossible with joint generation.

Progressive Resolution Training

The models are trained progressively:

Stage 1 (sparse structure) trains with 512×512 conditioning images to learn coarse occupancy priors.
Stages 2-3 start at 512³ output (32³ latent) and scale to 1024³ output (64³ latent), with conditioning images increasing to 1024×1024.

This progressive strategy transfers learned priors across resolutions, enabling efficient training of the large sparse DiTs while maintaining fidelity.

Why does TRELLIS.2 generate shape and materials in separate stages instead of jointly?

Because shape structure must be known for materials to be spatially aligned, and the decoupled design enables standalone texture generation for arbitrary input meshes Because joint generation requires too much GPU memory Because the VAE can only encode one modality at a time

Chapter 6: Training

Training a 4B-parameter 3D generation system requires careful data curation, infrastructure, and a multi-stage training protocol.

Training Data

Component	Dataset	Size	Purpose
SC-VAE	Trellis-500K (filtered)	~500K assets	Curated from Objaverse-XL, ABO, HSSD. Only assets with PBR materials kept.
DiT generators	Extended collection	~800K assets	Augmented with TexVerse for PBR diversity and realism
Image prompts	Rendered views	16 views/asset	Rendered in Blender with randomized FoVs and lighting
Evaluation (reconstruction)	Toys4K + Sketchfab Featured	~4090 + 90	Unseen during training. Sketchfab: complex PBR and detailed shapes
Evaluation (generation)	AI-generated prompts	100 images	Ensures train-test disjointness

Training Infrastructure

SC-VAE training: 16 H100 GPUs, batch size 128. Uses optimized Triton implementation for submanifold convolutions.
DiT training: 32 H100 GPUs, batch size 256. AdamW optimizer (lr = 1e-4, weight decay 0.01). Classifier-free guidance with 10% condition drop rate.

Data preprocessing pipeline: For each 3D asset: (1) Convert mesh to O-Voxel at target resolution (seconds on CPU). (2) Run SC-VAE encoder to get latent tokens. (3) Render 16 views in Blender with randomized camera FoVs and HDR lighting. (4) Extract DINOv3-L features from rendered views. The O-Voxel conversion is the key enabler — it's fast enough to preprocess 800K assets without becoming the bottleneck.

SC-VAE Training Protocol

Two-stage training, as described in Chapter 3:

Stage 1: Low-resolution data, direct reconstruction losses (MSE on vertices, BCE on flags, L1 on materials, KL divergence). Fast convergence to stabilize the latent space.
Stage 2: High-resolution data, add rendering-based perceptual losses (L1 + SSIM + LPIPS on rendered normals, depth, masks, and PBR attribute maps). Random camera placement with shallow near planes to capture internal structures.

Frozen vs. trained components: DINOv3-L backbone: frozen (pretrained on massive image data — retraining would be wasteful). SC-VAE encoder + decoder: trained from scratch. All three DiT generators: trained from scratch. RoPE embeddings: parameterless (computed from coordinates). The SC-VAE is trained first and frozen before DiT training begins — the DiTs learn to generate in a fixed latent space.

Rendering details for PBR evaluation: Split-sum renderer from nvdiffrec is used for PBR asset rendering. This properly handles metallic/roughness BRDF evaluation, giving physically accurate shading that tests whether generated materials are actually usable in production pipelines.

Why is the SC-VAE trained and frozen before DiT training begins?

Because the SC-VAE takes longer to converge Because the DiT requires more GPU memory Because the DiTs must learn to generate in a fixed, stable latent space — if the VAE were still changing, the DiTs' targets would be a moving target

Chapter 7: Results

TRELLIS.2 is evaluated on reconstruction fidelity, generation quality, and inference speed. The results are consistently dominant.

Reconstruction Quality

On the Toys4K benchmark (shape reconstruction):

Method	#Tokens	Downsample	MD ↓	F1 ↑	PSNR ↑
Dora (2K)	2.0K	—	366.1	0.019	22.02
TRELLIS	9.6K	4×	85.07	0.074	24.31
Direct3D-S2 1024	17K	8×	73.17	0.001	23.82
SparseFlex 1024	225K	4×	0.313	0.845	32.12
Ours 512	2.2K	16×	0.032	0.888	31.00
Ours 1024	9.6K	16×	0.004	0.971	35.26

At 1024³, TRELLIS.2 achieves 8× lower mesh distance than SparseFlex (which uses 23× more tokens). Even at 512³ with only 2.2K tokens, it outperforms methods using 10-100× more tokens.

Material reconstruction (our method only): No existing baseline can encode only materials given shapes, so TRELLIS.2 reports solo. PBR attribute PSNR: 38.89 dB / LPIPS: 0.033. Shaded image PSNR: 38.69 dB / LPIPS: 0.026. These are high-fidelity numbers — the materials faithfully reproduce the original under novel lighting.

Generation Quality

Image-to-3D generation compared against TRELLIS, Hi3DGen, Direct3D-S2, Step1X-3D, and Hunyuan3D 2.1:

CLIP alignment: Highest across all methods (0.894 CLIP, 0.477 CLIP-N)
Geometric similarity: Highest on ULIP-2 (0.758) and Uni3D (0.436)
User study: Preferred by ~40 participants across 100 AI-generated prompts, outperforming all baselines in visual realism, geometric detail, and prompt alignment

Speed

Resolution	Shape Time	Texture Time	Total
512³	~2s	~1s	~3s
1024³	~10s	~7s	~17s
1536³	~35s	~25s	~60s

All timings on a single NVIDIA H100 GPU. These are significantly faster than existing large 3D generation models, despite producing higher quality outputs with PBR materials.

Speed and Quality

Inference speed across resolutions, showing how the compact latent space enables fast generation even at 1536³. Click bars to see detailed breakdowns.

Test-Time Scaling

The compact latent space enables two forms of test-time scaling:

Resolution scaling: Generate at 1024³, downsample the O-Voxel to a higher-res sparse structure (e.g., 96³), re-apply Stage 2 to get 1536³ output. Finer details without retraining.
Compute scaling: Generate at 512³, downsample to 64³ structure, re-apply Stage 2 at 1024³. The cascaded inference corrects local errors and yields cleaner geometry. Adds ~3s for meaningfully finer details.

How does TRELLIS.2's mesh distance (MD) at 1024³ compare to SparseFlex at the same resolution?

TRELLIS.2 achieves ~8× lower MD (0.004 vs 0.031) despite using 23× fewer tokens (9.6K vs 225K) They achieve roughly the same MD SparseFlex has better MD because it uses more tokens

Chapter 8: Applications

TRELLIS.2's native 3D output with PBR materials opens several practical application paths that prior methods cannot support.

Image-to-3D Asset Generation

The primary use case: given a single image, generate a complete 3D asset ready for rendering. The output includes a mesh with proper topology (open surfaces, enclosed interiors supported) plus full PBR materials (base color, metallic, roughness, opacity). The asset is immediately usable in game engines, film pipelines, and AR applications without post-processing.

What the output looks like concretely: A 1024³ generated asset has ~600K triangles with UV-mapped PBR textures. It can be rendered in Unity, Unreal, or Blender with proper lighting response. Metal parts reflect environment maps. Rough surfaces scatter light diffusely. Glass is translucent. This is NOT a "textured mesh" with baked shading — it's a physically-based material that responds correctly to any lighting condition.

Shape-Conditioned Texture Generation

Stage 3 (material generation) can run independently. Given any input mesh and a reference image, it produces PBR materials aligned to the geometry. This beats alternatives:

Multi-view PBR methods (Hunyuan3D-Paint): Suffer from view inconsistencies — ghosting and blurred textures where views disagree.
UV-based methods (TEXGen): Suffer from ambiguous UV charts and seam artifacts.
TRELLIS.2 Stage 3: Reasons natively in 3D, producing sharp textures with consistent shape-material alignment. Can also texture internal surfaces — crucial for complex assets like vehicles with visible interiors.

Translucent and Complex Materials

The opacity channel α in O-Voxel enables translucent surface modeling — glass, ice, thin fabric. No prior 3D generation method handles this. The generated assets can be rendered with proper alpha blending in any standard pipeline.

Cascaded Super-Resolution

Because the latent space is resolution-agnostic (thanks to RoPE), a model trained at 1024³ can be applied to generate at 1536³ through cascaded inference. This requires no retraining — just re-running Stage 2 on a downsampled higher-res structure. The result: finer geometric details and sharper materials at the cost of additional inference time (~60s total for 1536³).

Integration with existing pipelines: O-Voxel's instant bidirectional conversion means TRELLIS.2 outputs are immediately compatible with any mesh-based workflow. Export as OBJ/FBX with PBR texture maps. The mesh has proper normals, UV coordinates, and material assignments. No Gaussian-to-mesh conversion, no NeRF baking, no post-processing pipeline. Generate → export → use.

What capability does the opacity channel in O-Voxel enable that no prior 3D generation method supports?

Higher resolution meshes Faster inference speed Translucent surface modeling — glass, ice, thin fabric — with proper alpha that can be rendered in standard pipelines

Chapter 9: Connections

TRELLIS.2 sits at the intersection of 3D representation learning, latent compression, and large-scale generative modeling. Let's map where it fits.

Relation to TRELLIS (v1)

TRELLIS introduced the Structured Latent (SLAT) representation for joint geometry-appearance modeling. But SLAT was built from multiview 2D features with rendering-based supervision, limiting its ability to capture complex structures and materials. TRELLIS.2 replaces this with native 3D data (O-Voxel), achieving 16× vs 4× compression, better quality, and proper PBR support. The generative pipeline (sparse DiT + flow matching) is evolutionary; the representation is revolutionary.

Relation to NeRF / 3D Gaussian Splatting

NeRFs and 3DGS represent appearance as view-dependent radiance — great for novel view synthesis, but the appearance is baked to specific lighting. O-Voxel stores intrinsic material properties (base color, metallic, roughness) that can be re-lit in any environment. The tradeoff: O-Voxel requires PBR-annotated training data, while NeRF/3DGS can learn from posed images alone.

Relation to DC-AE (Deep Compression Autoencoder)

DC-AE demonstrated that residual autoencoding enables extreme spatial compression in 2D image VAEs. TRELLIS.2 adapts this principle to sparse 3D voxels — the space-to-channel rearrangement and non-parametric shortcuts are directly inherited. The key extension: handling sparsity (most voxels are empty, children may not exist).

Relation to Flow Matching / Rectified Flow

The generation backbone uses flow matching rather than DDPM-style diffusion. Flow matching learns straight-line transport from noise to data, enabling fewer sampling steps. The DiT architecture follows the standard recipe (AdaLN, RoPE), with the novelty being its application to structured sparse 3D latents rather than 2D image tokens.

Cheat Sheet

Aspect	TRELLIS.2
Representation	O-Voxel (field-free sparse voxels with geometry + PBR)
Compression	SC-VAE, 16× spatial downsample, ~9.6K tokens @ 1024³
Generator	3 DiT models (~4B total), flow matching
Pipeline	Structure → Shape → Material (sequential)
Input	Single image (DINOv3-L features)
Output	Mesh + PBR textures (base color, metallic, roughness, alpha)
Speed (1024³)	~17s on H100 (~10s shape + ~7s texture)
Speed (512³)	~3s on H100
Training data	~800K 3D assets (Objaverse-XL, ABO, HSSD, TexVerse)
Training compute	32 H100 GPUs, batch 256, AdamW lr=1e-4
Key ablation	w/o Residual AE: MD +69%, PSNR -0.5dB
vs TRELLIS	16× vs 4× compression, native 3D vs multiview, PBR vs baked

The broader lesson: Representation is everything. A better 3D representation (O-Voxel) enables better compression (16× SC-VAE), which enables more efficient generation (fewer tokens for DiT to process), which enables higher resolution (1536³) and higher quality (PBR materials). Every downstream improvement traces back to getting the representation right.

What core 2D technique did TRELLIS.2 adapt for its sparse 3D compression?

Vision Transformer patch embedding DC-AE's residual autoencoding — rearranging spatial information into channels before downsampling to preserve structure through extreme compression U-Net skip connections from image segmentation

Native and Compact Structured Latents for 3D Generation