Xiang, Chen, Xu, Wang, Lv, Deng, Zhu, Dong, Zhao, Yuan, Yang — Tsinghua, MSR, USTC, Microsoft AI, 2025

Native and Compact Structured Latents for 3D Generation

Encode geometry AND PBR materials into sparse voxels, compress them 16× into ~9.6K latent tokens, then generate high-fidelity 3D assets with a 4B-parameter flow-matching model in under a minute.

Prerequisites: VAE basics + Sparse convolutions (intuition) + Flow matching / diffusion
10
Chapters
4
Simulations

Chapter 0: The Problem

You want to generate a 3D asset from a single image. Not just a blobby shape — a detailed mesh with proper materials: shiny metal here, rough wood there, translucent glass in the middle. Something a game engine can render under any lighting.

The bottleneck is representation. Before you can generate anything, you need a way to encode 3D assets that neural networks can process. And every existing choice has painful limitations.

The SDF Trap

Most large 3D generation models use signed distance functions (SDFs) or occupancy fields to represent geometry. An SDF assigns every point in 3D space a number: negative inside the surface, positive outside, zero on the surface. You extract the mesh by finding the zero-crossing.

This sounds elegant, but it makes a fatal assumption: the surface must be watertight and manifold. Every point must have a clear "inside" and "outside." Real-world 3D assets violate this constantly:

The fundamental limitation: SDFs, Flexicubes, and all field-based representations require every surface to divide space into "inside" and "outside." Open surfaces, T-junctions, and enclosed interiors have no such division. Preprocessing these assets (closing holes, removing intersections) is lossy — you destroy the very geometry you wanted to capture.

The Appearance Gap

Even if geometry were solved, most 3D generation models ignore materials. They generate a shape and slap on a diffuse color texture. But real assets have physically-based rendering (PBR) attributes: metallic surfaces reflect light differently than plastic; rough surfaces scatter light; glass is translucent. Without PBR, generated assets look flat and fake under novel lighting.

TRELLIS (the predecessor) tried to handle appearance by baking multiview 2D features into 3D Gaussians. But this is rendering-dependent — the appearance is view-dependent, not intrinsic to the surface. You can't re-light a Gaussian splat the way you can a mesh with PBR materials.

Full data flow at a glance: Input image I ∈ RH×W×3 → DINOv3-L features → Stage 1: predict sparse structure layout (which voxels are active) → Stage 2: generate geometry latents within active voxels → SC-VAE decoder → O-Voxel shape (dual vertices + edge flags) → Stage 3: generate material latents conditioned on geometry → SC-VAE decoder → O-Voxel materials (base color + metallic + roughness + alpha) → extract mesh + PBR textures. Total: ~17s for 1024³ on H100.
Topology Failures of SDF

SDFs require a clear inside/outside. Drag the slider to see how open surfaces and non-manifold geometry break the SDF assumption. The red region shows where the SDF is undefined or produces artifacts.

Topology Watertight
Why can't SDFs represent a sheet of paper (an open surface)?

Chapter 1: The Key Insight

TRELLIS.2's insight is deceptively simple: don't use fields at all. Instead, represent 3D assets directly as sparse voxels that store both geometry and materials — then compress those voxels into a compact latent space.

Step 1: Native 3D Encoding
Convert any mesh (open, non-manifold, enclosed) into O-Voxels: sparse voxels on an N×N×N grid. Each voxel stores dual vertex position + edge flags (geometry) and base color + metallic + roughness + alpha (materials). No SDF, no field. Direct mesh-to-voxel conversion in seconds.
Step 2: Compress 16×
A Sparse Compression VAE with residual autoencoding downsamples the voxel grid 16× spatially. A 1024³ asset with ~600K active voxels becomes ~9.6K latent tokens. That's 10× fewer tokens than original TRELLIS, and 60× fewer than SparseFlex at 1024³.
Step 3: Generate in Latent Space
Three DiT-based flow-matching models (total ~4B parameters) generate structure, geometry, and materials sequentially. Image-conditioned. ~17s for a full 1024³ PBR asset on H100.
Why "field-free" matters: By skipping the SDF/occupancy field entirely, O-Voxel sidesteps all topology constraints. The mesh surface directly determines which voxels are active and where dual vertices go. No sign computation, no flood-fill, no watertight assumption. Open surfaces, T-junctions, enclosed interiors — all handled natively. The conversion is also instant: mesh → O-Voxel in seconds on CPU, O-Voxel → mesh in tens of milliseconds.
Why 16× and not 8× or 32×? Prior sparse voxel methods (TRELLIS, SparseFlex) achieve only 4× spatial downsampling. Their residual blocks can't compress further without severe quality loss. TRELLIS.2's residual autoencoding (adapted from DC-AE for 2D images) rearranges spatial information into channel dimensions before downsampling, enabling 16× compression with negligible perceptual degradation. At 32×, quality does degrade — MD increases 526% and PSNR drops 1.6dB. 16× is the sweet spot: compact enough for efficient generation, faithful enough for production quality.
What degrades without each component: Remove residual autoencoding → MD increases 69%, PSNR drops 0.5dB at 16×. Remove optimized ConvNeXt-style residual blocks → MD increases 16%, PSNR drops 0.6dB. Remove rendering loss in stage 2 training → material fidelity drops significantly (no perceptual supervision). Remove PBR attributes entirely → assets can't be re-lit, look flat under novel lighting.
What is the key architectural difference between TRELLIS.2 and its predecessor TRELLIS?

Chapter 2: O-Voxel Representation

O-Voxel ("omni-voxel") is the foundation of everything. It's a sparse voxel structure on an N×N×N grid where each active voxel stores a feature tuple:

f = {(fshapei, fmati, pi)}Li=1

where pi ∈ {0, 1, ..., N-1}3 is the voxel coordinate, and only voxels intersecting the mesh surface are active. Let's unpack each component.

Shape: The Flexible Dual Grid

For geometry, each active voxel stores three things:

FeatureSymbolShapeMeaning
Dual vertexviR3[0,1]A point within the voxel representing local surface position
Edge intersection flagsδi{0,1}3Which of the 3 canonical edges (X, Y, Z) the mesh intersects
Splitting weightsγiR>0Controls how quad faces split into triangles

The idea comes from Dual Contouring (DC), but with a crucial difference: DC requires a signed distance field to detect sign changes across edges. O-Voxel skips the field entirely. It directly tests whether each voxel edge intersects the mesh surface. If an edge crosses a triangle, the adjacent dual face is activated, and Hermite data (intersection point + normal) positions the dual vertex.

The dual vertex position is computed by minimizing a Quadratic Error Function (QEF):

minv∈voxel e(v) = Σi d2Π,i + λbound Σj d2L,j + λreg d2

Three terms, each serving a purpose:

Mesh → O-Voxel (instant, CPU-only): (1) Rasterize mesh triangles onto the N3 grid to find all voxel edges that intersect the surface. (2) Mark neighboring voxels as active. (3) Compute Hermite data (intersection points + normals) analytically from triangle geometry. (4) Solve QEF (Eq. above) in closed form for each active voxel. Total time: a few seconds on a single CPU for 1024³. No optimization loop, no GPU needed.

Material: Volumetric Surface Attributes

Each active voxel also stores PBR material attributes:

fmati = (ci, mi, ri, αi)
AttributeSymbolRangeMeaning
Base colorciR3[0,1]RGB albedo (diffuse + specular base)
MetallicmiR[0,1]0 = dielectric (plastic, wood), 1 = metal
RoughnessriR[0,1]0 = mirror-smooth, 1 = fully diffuse
OpacityαiR[0,1]0 = fully transparent, 1 = opaque

Materials are sampled by projecting each voxel center onto intersected triangles and reading from the texture map using UV coordinates. The inverse is equally simple: to reconstruct a material at any surface point, trilinear interpolation of neighboring voxel attributes gives the answer. No baking, no optimization.

Per-voxel feature dimensionality: Shape features: 3 (dual vertex) + 3 (edge flags) + 1 (splitting weight) = 7. Material features: 3 (base color) + 1 (metallic) + 1 (roughness) + 1 (opacity) = 6. Total: 13 channels per active voxel. For a 1024³ asset with ~600K active voxels, that's ~600K × 13 ≈ 7.8M floats before compression. After 16× SC-VAE compression: ~9.6K tokens × 32 channels = ~307K floats. A 25× reduction in raw data.
What is the key advantage of O-Voxel's "field-free" design over traditional Dual Contouring?

Chapter 3: Sparse Compression VAE

O-Voxel gives us a native 3D representation. But a 1024³ grid with ~600K active voxels is too large for a generative model to work with directly. We need to compress it into a compact latent space — that's the job of the Sparse Compression VAE (SC-VAE).

Architecture: Fully Sparse-Convolutional

Unlike TRELLIS (which used transformers), SC-VAE is a fully sparse-convolutional U-shaped network. It processes only active voxels, skipping the ~99.4% of the grid that's empty. The encoder downsamples hierarchically through residual blocks; the decoder mirrors this for reconstruction.

The Residual Autoencoding Trick

The key innovation enabling 16× compression is Sparse Residual Autoencoding, adapted from DC-AE (originally designed for 2D images). The problem: at 16× downsampling, you're squeezing 163 = 4096 spatial voxels into a single latent. Standard pooling destroys too much information.

The solution: before each 2× downsample, rearrange the 8 child voxels of each parent into the channel dimension:

Fcoarseraw = stack(Fchild1, ..., Fchild8) ∈ R8C
Fcoarse = avg_groups(Fcoarseraw) ∈ RC'

This is a non-parametric shortcut: spatial information is preserved in channels, then averaged across groups. The learnable conv layers only need to refine a residual on top of this estimate. During upsampling, the symmetric operation distributes channels back to spatial positions:

Ffineraw = unstack(Fcoarse) ∈ R8C/8
Ffine = dup_groups(Ffineraw) ∈ RC
Why this works: Without residual shortcuts, the convolution layers must learn to both compress spatial structure AND preserve detail. That's too much to ask at 16×. The residual shortcut handles coarse structure preservation for free (non-parametric averaging), letting the convolutions focus on fine detail refinement. Ablation: removing residual AE at 16× causes MD to increase 69% and PSNR to drop 0.5dB. At 32×, the degradation is catastrophic (526% MD increase) — even residual shortcuts can't save that much compression.

Optimized ConvNeXt-style Blocks

Sparse convolutions are computationally expensive at high sparsity. SC-VAE replaces the standard two-conv residual block with a ConvNeXt-inspired design: one sparse conv layer followed by a wide point-wise MLP (analogous to a Transformer FFN). This doesn't change runtime but improves reconstruction quality: MD drops 16%, PSNR gains 0.6dB.

Early-Pruning Upsampler

During decoding, not all child voxels of a parent should be active. Before each upsample step, the network predicts a binary mask ρ̂ ∈ {0,1}8 for each parent, specifying which children to activate. Inactive children are pruned, saving both computation and memory.

Concrete compression numbers: Input: 1024³ O-Voxel with ~600K active voxels × 13 channels. After 16× spatial downsampling (4 stages of 2×): ~9.6K latent tokens × 32 channels = ~307K floats. Spatial resolution shrinks from 1024³ to 64³. Compare: TRELLIS at 4× downsampling produces ~20K tokens. Direct3D-S2 at 8× produces ~17K tokens at 1024³. SparseFlex at 4× produces 225K tokens at 1024³. TRELLIS.2 is 23× more compact than SparseFlex.
Compression Pipeline

Watch how a 1024³ voxel grid compresses through 4 stages of 2× downsampling. Each stage halves spatial resolution while doubling channels. The residual shortcut preserves structure at each step.

Stage Input (1024³)

Two-Stage Training

SC-VAE is trained in two stages:

  1. Stage 1 (low-resolution): Direct O-Voxel reconstruction losses. MSE on dual vertex positions, BCE on edge flags and pruning masks, L1 on materials, plus KL divergence.
  2. Stage 2 (high-resolution): Add rendering-based perceptual supervision. Render mask, depth, normal, and material maps from random camera positions. Supervise with L1 + SSIM + LPIPS. Cameras are placed with shallow near planes to slice through the surface, forcing the model to capture internal structures too.
Decoupled latent spaces: Shape and material are encoded by separate SC-VAEs. The shape VAE encodes geometry features alone. The material VAE encodes material features, conditioned on the shape VAE's subdivision structures during upsampling. This decoupling enables sequential generation: generate shape first, then generate materials conditioned on that shape. It also allows texture-only generation for given shapes.
What is the purpose of the Sparse Residual Autoencoding shortcut?

Chapter 4: Latent Space (Showcase)

Let's put the compression in perspective. How does TRELLIS.2's latent space compare to every other 3D generation method?

The Token Efficiency Frontier

Every 3D latent representation faces a tradeoff: more tokens give better reconstruction quality, but make generation slower and harder. The ideal method sits in the top-left corner: few tokens, high quality.

MethodResolutionDownsample#Tokens#Dims (total)Decode Time
Dora2.0K131K37.7s
TRELLIS9.6K77K0.108s
Direct3D-S2 512512³3.0K48K1.86s
Direct3D-S2 10241024³17K271K13.0s
SparseFlex 512512³56K452K
SparseFlex 10241024³225K1799K
Ours 512512³16×2.2K70K
Ours 10241024³16×9.6K306K

At 512³, TRELLIS.2 uses only 2.2K tokens — 25× fewer than SparseFlex. At 1024³, it uses 9.6K tokens — 23× fewer than SparseFlex and nearly 2× fewer than Direct3D-S2. Yet its reconstruction quality is substantially better across every metric.

Why fewer tokens AND better quality? The answer is O-Voxel + Residual Autoencoding. O-Voxel stores geometry and materials directly (no redundant field evaluation), so the raw data is already compact. The residual shortcut preserves most of this information through aggressive downsampling. Other methods start from less efficient representations (full grids, point clouds) and use weaker compression, so they need more tokens to achieve less.
Token Efficiency Comparison

Interactive bar chart comparing token counts and reconstruction quality (Normal PSNR) across methods. Toggle resolution to compare at 512³ and 1024³. The ideal method sits in the bottom-right: fewer tokens, higher PSNR.

Resolution 512³

What Happens at Different Compression Ratios

The ablation on Sketchfab assets at 256³ tells the full story:

SettingDownsample#TokensMD ↓PSNR ↑
SC-VAE f16c3216×5031.03227.26
w/o Residual AE16×5031.747 (+69%)26.73
w/o Opt. ResBlock16×5031.198 (+16%)26.67
SC-VAE f32c12832×1181.40526.65
w/o Residual AE (32×)32×1187.394 (+526%)25.01

At 16×, residual AE keeps quality high. At 32×, even with residual AE, quality degrades. Without it, 32× is catastrophic. The 16× sweet spot delivers production-quality reconstruction with a manageable token count.

At 1024³ resolution, how does TRELLIS.2's token count compare to SparseFlex?

Chapter 5: Flow Matching for 3D

With the latent space defined, we need a generative model to sample from it. TRELLIS.2 uses flow matching — a cleaner alternative to diffusion that learns straight-line transport from noise to data.

Three-Stage Generation Pipeline

Generation unfolds in three sequential stages, each with its own DiT (Diffusion Transformer):

Stage 1: Sparse Structure
Predict which voxels are active (the occupancy layout). Input: image features from DINOv3-L. Output: binary occupancy grid at 64³ (the latent resolution). This is a classification problem — a lightweight model.
Stage 2: Geometry Generation
Given the active voxel layout, generate geometry latents within active voxels. A sparse DiT takes the layout + image features and produces shape latent tokens via flow matching. SC-VAE shape decoder converts to O-Voxel geometry (dual vertices + edge flags).
Stage 3: Material Generation
Given the generated geometry, generate material latents. A sparse DiT conditioned on image features + geometry latents produces material tokens. SC-VAE material decoder converts to PBR attributes (base color, metallic, roughness, alpha). This is the novel stage — native 3D PBR generation.

DiT Architecture Details

Each DiT module uses:

Architecture scaling: Each DiT has ~1.3B parameters (width: 1536, blocks: 30, heads: 12, MLP width: 8192). Three DiTs total ~3.9B parameters, plus the SC-VAEs and conditioning encoder, reaching ~4B total. Critically, the compact latent space (9.6K tokens at 1024³) means these DiTs process far fewer tokens than if the latent space were larger. A vanilla DiT architecture suffices — no need for the convolutional packing or skip connections that TRELLIS required to handle its 20K tokens efficiently.
Why separate shape and texture? Two reasons. First, shape structure must be known before materials can be spatially aligned to it. Second, the decoupled pipeline enables shape-conditioned texture generation as a standalone tool: given any 3D mesh and a reference image, Stage 3 alone can synthesize PBR materials. This is impossible with joint generation.

Progressive Resolution Training

The models are trained progressively:

  1. Stage 1 (sparse structure) trains with 512×512 conditioning images to learn coarse occupancy priors.
  2. Stages 2-3 start at 512³ output (32³ latent) and scale to 1024³ output (64³ latent), with conditioning images increasing to 1024×1024.

This progressive strategy transfers learned priors across resolutions, enabling efficient training of the large sparse DiTs while maintaining fidelity.

Why does TRELLIS.2 generate shape and materials in separate stages instead of jointly?

Chapter 6: Training

Training a 4B-parameter 3D generation system requires careful data curation, infrastructure, and a multi-stage training protocol.

Training Data

ComponentDatasetSizePurpose
SC-VAETrellis-500K (filtered)~500K assetsCurated from Objaverse-XL, ABO, HSSD. Only assets with PBR materials kept.
DiT generatorsExtended collection~800K assetsAugmented with TexVerse for PBR diversity and realism
Image promptsRendered views16 views/assetRendered in Blender with randomized FoVs and lighting
Evaluation (reconstruction)Toys4K + Sketchfab Featured~4090 + 90Unseen during training. Sketchfab: complex PBR and detailed shapes
Evaluation (generation)AI-generated prompts100 imagesEnsures train-test disjointness

Training Infrastructure

Data preprocessing pipeline: For each 3D asset: (1) Convert mesh to O-Voxel at target resolution (seconds on CPU). (2) Run SC-VAE encoder to get latent tokens. (3) Render 16 views in Blender with randomized camera FoVs and HDR lighting. (4) Extract DINOv3-L features from rendered views. The O-Voxel conversion is the key enabler — it's fast enough to preprocess 800K assets without becoming the bottleneck.

SC-VAE Training Protocol

Two-stage training, as described in Chapter 3:

Frozen vs. trained components: DINOv3-L backbone: frozen (pretrained on massive image data — retraining would be wasteful). SC-VAE encoder + decoder: trained from scratch. All three DiT generators: trained from scratch. RoPE embeddings: parameterless (computed from coordinates). The SC-VAE is trained first and frozen before DiT training begins — the DiTs learn to generate in a fixed latent space.
Rendering details for PBR evaluation: Split-sum renderer from nvdiffrec is used for PBR asset rendering. This properly handles metallic/roughness BRDF evaluation, giving physically accurate shading that tests whether generated materials are actually usable in production pipelines.
Why is the SC-VAE trained and frozen before DiT training begins?

Chapter 7: Results

TRELLIS.2 is evaluated on reconstruction fidelity, generation quality, and inference speed. The results are consistently dominant.

Reconstruction Quality

On the Toys4K benchmark (shape reconstruction):

Method#TokensDownsampleMD ↓F1 ↑PSNR ↑
Dora (2K)2.0K366.10.01922.02
TRELLIS9.6K85.070.07424.31
Direct3D-S2 102417K73.170.00123.82
SparseFlex 1024225K0.3130.84532.12
Ours 5122.2K16×0.0320.88831.00
Ours 10249.6K16×0.0040.97135.26

At 1024³, TRELLIS.2 achieves 8× lower mesh distance than SparseFlex (which uses 23× more tokens). Even at 512³ with only 2.2K tokens, it outperforms methods using 10-100× more tokens.

Material reconstruction (our method only): No existing baseline can encode only materials given shapes, so TRELLIS.2 reports solo. PBR attribute PSNR: 38.89 dB / LPIPS: 0.033. Shaded image PSNR: 38.69 dB / LPIPS: 0.026. These are high-fidelity numbers — the materials faithfully reproduce the original under novel lighting.

Generation Quality

Image-to-3D generation compared against TRELLIS, Hi3DGen, Direct3D-S2, Step1X-3D, and Hunyuan3D 2.1:

Speed

ResolutionShape TimeTexture TimeTotal
512³~2s~1s~3s
1024³~10s~7s~17s
1536³~35s~25s~60s

All timings on a single NVIDIA H100 GPU. These are significantly faster than existing large 3D generation models, despite producing higher quality outputs with PBR materials.

Speed and Quality

Inference speed across resolutions, showing how the compact latent space enables fast generation even at 1536³. Click bars to see detailed breakdowns.

Test-Time Scaling

The compact latent space enables two forms of test-time scaling:

How does TRELLIS.2's mesh distance (MD) at 1024³ compare to SparseFlex at the same resolution?

Chapter 8: Applications

TRELLIS.2's native 3D output with PBR materials opens several practical application paths that prior methods cannot support.

Image-to-3D Asset Generation

The primary use case: given a single image, generate a complete 3D asset ready for rendering. The output includes a mesh with proper topology (open surfaces, enclosed interiors supported) plus full PBR materials (base color, metallic, roughness, opacity). The asset is immediately usable in game engines, film pipelines, and AR applications without post-processing.

What the output looks like concretely: A 1024³ generated asset has ~600K triangles with UV-mapped PBR textures. It can be rendered in Unity, Unreal, or Blender with proper lighting response. Metal parts reflect environment maps. Rough surfaces scatter light diffusely. Glass is translucent. This is NOT a "textured mesh" with baked shading — it's a physically-based material that responds correctly to any lighting condition.

Shape-Conditioned Texture Generation

Stage 3 (material generation) can run independently. Given any input mesh and a reference image, it produces PBR materials aligned to the geometry. This beats alternatives:

Translucent and Complex Materials

The opacity channel α in O-Voxel enables translucent surface modeling — glass, ice, thin fabric. No prior 3D generation method handles this. The generated assets can be rendered with proper alpha blending in any standard pipeline.

Cascaded Super-Resolution

Because the latent space is resolution-agnostic (thanks to RoPE), a model trained at 1024³ can be applied to generate at 1536³ through cascaded inference. This requires no retraining — just re-running Stage 2 on a downsampled higher-res structure. The result: finer geometric details and sharper materials at the cost of additional inference time (~60s total for 1536³).

Integration with existing pipelines: O-Voxel's instant bidirectional conversion means TRELLIS.2 outputs are immediately compatible with any mesh-based workflow. Export as OBJ/FBX with PBR texture maps. The mesh has proper normals, UV coordinates, and material assignments. No Gaussian-to-mesh conversion, no NeRF baking, no post-processing pipeline. Generate → export → use.
What capability does the opacity channel in O-Voxel enable that no prior 3D generation method supports?

Chapter 9: Connections

TRELLIS.2 sits at the intersection of 3D representation learning, latent compression, and large-scale generative modeling. Let's map where it fits.

Relation to TRELLIS (v1)

TRELLIS introduced the Structured Latent (SLAT) representation for joint geometry-appearance modeling. But SLAT was built from multiview 2D features with rendering-based supervision, limiting its ability to capture complex structures and materials. TRELLIS.2 replaces this with native 3D data (O-Voxel), achieving 16× vs 4× compression, better quality, and proper PBR support. The generative pipeline (sparse DiT + flow matching) is evolutionary; the representation is revolutionary.

Relation to NeRF / 3D Gaussian Splatting

NeRFs and 3DGS represent appearance as view-dependent radiance — great for novel view synthesis, but the appearance is baked to specific lighting. O-Voxel stores intrinsic material properties (base color, metallic, roughness) that can be re-lit in any environment. The tradeoff: O-Voxel requires PBR-annotated training data, while NeRF/3DGS can learn from posed images alone.

Relation to DC-AE (Deep Compression Autoencoder)

DC-AE demonstrated that residual autoencoding enables extreme spatial compression in 2D image VAEs. TRELLIS.2 adapts this principle to sparse 3D voxels — the space-to-channel rearrangement and non-parametric shortcuts are directly inherited. The key extension: handling sparsity (most voxels are empty, children may not exist).

Relation to Flow Matching / Rectified Flow

The generation backbone uses flow matching rather than DDPM-style diffusion. Flow matching learns straight-line transport from noise to data, enabling fewer sampling steps. The DiT architecture follows the standard recipe (AdaLN, RoPE), with the novelty being its application to structured sparse 3D latents rather than 2D image tokens.

Cheat Sheet

AspectTRELLIS.2
RepresentationO-Voxel (field-free sparse voxels with geometry + PBR)
CompressionSC-VAE, 16× spatial downsample, ~9.6K tokens @ 1024³
Generator3 DiT models (~4B total), flow matching
PipelineStructure → Shape → Material (sequential)
InputSingle image (DINOv3-L features)
OutputMesh + PBR textures (base color, metallic, roughness, alpha)
Speed (1024³)~17s on H100 (~10s shape + ~7s texture)
Speed (512³)~3s on H100
Training data~800K 3D assets (Objaverse-XL, ABO, HSSD, TexVerse)
Training compute32 H100 GPUs, batch 256, AdamW lr=1e-4
Key ablationw/o Residual AE: MD +69%, PSNR -0.5dB
vs TRELLIS16× vs 4× compression, native 3D vs multiview, PBR vs baked
The broader lesson: Representation is everything. A better 3D representation (O-Voxel) enables better compression (16× SC-VAE), which enables more efficient generation (fewer tokens for DiT to process), which enables higher resolution (1536³) and higher quality (PBR materials). Every downstream improvement traces back to getting the representation right.
What core 2D technique did TRELLIS.2 adapt for its sparse 3D compression?