Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning

Sobol, Ido; Sohn, Kihyuk; Blum, Yoav; Zakharov, Egor; Bluvstein, Max; Vedaldi, Andrea; Litany, Or

TL;DR: Realiz3D is a framework to train controllable and realistic diffusion models using synthetic annotated data and real unlabeled data.

Realiz3D is a framework that leverages both real and synthetic data to train diffusion models that generate photorealistic images while faithfully adhering to input conditions and maintaining 3D consistency. Compared to standard fine-tuning on mixed real and synthetic data, Realiz3D produces noticeably more realistic results while preserving geometric fidelity across views.

Abstract

We often aim to generate images that are both photorealistic and 3D-consistent, adhering to precise geometry, material, and viewpoint controls. Typically, this is achieved by fine-tuning an image generator, pre-trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available. While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images. To address this, we introduce Realiz3D, a lightweight framework for training diffusion models, that decouples controls and visual domain. The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co-variate that, fed into small residual adapters, shifts the domain. Then, the generator can be trained to gain controllability, without fitting to specific visual domain. In this way, the model can be guided to produce realistic images even when controls are applied. We enhance control transferability to the real domain by leveraging insights about roles of different layers and denoising steps in diffusion-based generators, informing new training and inference strategies that further mitigate the gap. We demonstrate the advantages of Realiz3D in tasks as text-to-multiview generation and texturing from 3D inputs, producing outputs that are 3D-consistent and photorealistic.

Method

Realiz3D introduces Domain Shifters, lightweight residual adapters that learn visual domain identity (real vs. synthetic) independently of control signals, enabling the model to learn controllability without compromising realism.
(Top left) A Domain Shifter encodes domain identity as a low-rank residual added to latent features.
(Top right) Stage 1: Domain Shifters are trained with real and synthetic data, learning domain separation.
(Bottom) Stage 2: The diffusion model is fine-tuned for controllable generation using both domains.
(Bottom left) Synthetic samples teach controllability under the synthetic mode. (Bottom right) Real samples are used for Representation Binding, combining (1) Layer-Aware Training, freezing early layers while updating later ones, and (2) Domain Reassignment, occasionally reusing the synthetic mode in early layers to transfer control to the real domain.

Text-to-3D Results

We perform text-to-3D generation by performing multiview texturing and backprojecting the textures onto their corresponding original meshes. Occluded regions are naïvely filled and may appear blurry.
For comparison, we also present the results of standard fine-tuning: directly fine-tuning the same base model using the same mix of real and synthetic data. See additional baselines in the paper.

Text-to-Multiview Results

Multiview Texturing Results

BibTeX

@inproceedings{sobol2026realiz3d,
  author = {Ido Sobol and Kihyuk Sohn and Yoav Blum and Egor Zakharov and Max Bluvstein and Andrea Vedaldi and Or Litany},
  booktitle = {Proceedings of the {IEEE}/CVF Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title = {{Realiz3D}: {3D} Generation Made Photorealistic via Domain-Aware Learning},
  year = {2026}
}