nvidia
/

difix

+# NVIDIA Fixer Model: Difix3D+
+[Project Page](https://research.nvidia.com/labs/toronto-ai/difix3d/) | [Code](https://github.com/nv-tlabs/Difix3D) | [Paper](https://arxiv.org/abs/2503.01774)
+Fixer is a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by
+underconstrained regions of 3D representation. The technology behind Fixer is based on the concepts outlined in the paper titled
+[DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models](https://arxiv.org/abs/2503.01774 ).
+Fixer has two operation modes:
+* Offline mode: Used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction
+  and then distill them back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality.
+* Online mode: Acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D
+  supervision and the limited capacity of current reconstruction models.
+Fixer is an all-encompassing solution, a single model compatible for both NeRF and 3DGS representations.
+This model is ready for commercial use.
+**License/Terms of Use:** Your use of the software container is governed by the NVIDIA Software License Agreement and Product-Specific Terms for NVIDIA AI Products. Your use of the model is governed by the NVIDIA Open Model License Agreement.[a]
+**Deployment Geography:** Global
+**Use Case:** Fixer is intended for Autonomous Vehicles developers to enhance and improve their Neural Reconstruction pipelines. The model takes an image as an input and outputs a fixed image.
+**Release Date:** V1: June 2025
+### Model Sources
+- **Repository:** [GitHub](https://github.com/nv-tlabs/Difix3D)
+- **Paper [optional]:** [DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models]https://arxiv.org/abs/2503.01774
+## Use the Fixer Model
+1. **Set up your environment.** Run the following command to set up your environment:
+	```bash
+	pip install -r requirements.txt
+	```
+2. **Prepare your data.** Use the following JSON format for your data:
+	```json
+	{
+	"train": {
+		"{data_id}": {
+		"image": "{PATH_TO_IMAGE}",
+		"target_image": "{PATH_TO_TARGET_IMAGE}",
+		"ref_image": "{PATH_TO_REF_IMAGE}",
+		"prompt": "remove degradation"
+		}
+	},
+	"test": {
+		"{data_id}": {
+		"image": "{PATH_TO_IMAGE}",
+		"target_image": "{PATH_TO_TARGET_IMAGE}",
+		"ref_image": "{PATH_TO_REF_IMAGE}",
+		"prompt": "remove degradation"
+		}
+	}
+	}
+	```
+3. **Run the model.** Modify the following scripts for your use case:
+	**Single GPU**
+	```bash
+	accelerate launch --mixed_precision=bf16 src/train_difix.py \
+		--output_dir=./outputs/difix/train \
+		--dataset_path="data/data.json" \
+		--max_train_steps 100000 \
+		--resolution=512 --learning_rate 2e-5 \
+		--train_batch_size=1 --dataloader_num_workers 8 \
+		--enable_xformers_memory_efficient_attention \
+		--checkpointing_steps=1000 --eval_freq 1000 --viz_freq 100 \
+		--lambda_lpips 1.0 --lambda_l2 1.0 --lambda_gram 1.0 --gram_loss_warmup_steps 2000 \
+		--report_to "wandb" --tracker_project_name "difix" --tracker_run_name "train" --timestep 199
+	```
+	**Multipe GPUs**
+	```bash
+	export NUM_NODES=1
+	export NUM_GPUS=8
+	accelerate launch --mixed_precision=bf16 --main_process_port 29501 --multi_gpu --num_machines $NUM_NODES --num_processes $NUM_GPUS src/train_difix.py \
+		--output_dir=./outputs/difix/train \
+		--dataset_path="data/data.json" \
+		--max_train_steps 100000 \
+		--resolution=512 --learning_rate 2e-5 \
+		--train_batch_size=1 --dataloader_num_workers 8 \
+		--enable_xformers_memory_efficient_attention \
+		--checkpointing_steps=1000 --eval_freq 1000 --viz_freq 100 \
+		--lambda_lpips 1.0 --lambda_l2 1.0 --lambda_gram 1.0 --gram_loss_warmup_steps 2000 \
+		--report_to "wandb" --tracker_project_name "difix" --tracker_run_name "train" --timestep 199
+	```
+## Inference
+Download the pre-trained model from [Google Drive](https://drive.google.com/file/d/1BJPTf02yABE6wneRkndudg87ECZ2oAcS/view?usp=sharing) and place it in the `checkpoints` directory.
+```bash
+python src/inference_difix.py \
+    --model_path "checkpoints/model.pkl" \
+    --input_image "assets/example_input.png" \
+    --prompt "remove degradation" \
+    --output_dir "outputs/inference" \
+    --timestep 199
+```
+**Engine:** PyTorch>=2.0.0
+**Test Hardware:**
+We tested on H100, A100. Inference time using 1XA100, 32bit is 0.355 seconds.[e]
+## Citation
+```bibtex
+@inproceedings{wu2025difix3d,
+  title={Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models},
+  author={Wu, Jay Zhangjie and Zhang, Yuxuan and Turki, Haithem and Ren, Xuanchi and Gao, Jun and Shou, Mike Zheng and Fidler, Sanja and Gojcic, Zan and Ling, Huan},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  year={2025}
+}
+```
+## Fixer Model Details
+**Network Architecture:** Linear-attention Diffusion Transformer with a Deep Compression Autoencoder (DC-AE) for efficient high-resolution image generation.
+### Input
+**Input Type(s):** Image
+**Input Format(s):** Red, Green, Blue (RGB)
+**Input Parameters:** Two-Dimensional (2D)
+**Other Properties Related to Input:** Specific Resolution: [576px x 1024px]
+### Output
+**Output Type(s):** Image
+**Output Format(s):** Red, Green, Blue (RGB)
+**Output Parameters:** Two-Dimensional (2D)
+**Other Properties Related to Output:** Specific Resolution: [576px x 1024px]
+### Software Integration:
+**Runtime Engine:** PyTorch
+**Supported Hardware Microarchitecture Compatibility:** NVIDIA Ampere
+**[Preferred/Supported] Operating System(s):** Linux
+## Model Version(s):
+SanaMS_1600M_P1_D20[c]
+## Training, Testing, and Evaluation Datasets:
+Fixer was trained, tested, and evaluated using 2 datasets, where 80% of the data was used for training, 10% for evaluation, and 10% for testing.
+### Fixer Dataset
+**Data Collection Method:**  Human
+**Labeling Method by Dataset:** Human
+**Properties:** The dataset is collected by the NVIDIA Sana team. The dataset contains only AI-generated images, directly obtained from the public section of the official websites: ideogram.ai/t/explore and midjourney.com/explore. Please note that we do not use the Ideogram or Midjourney APIs to generate these images. Instead, we source images that have been created and uploaded by platform users to the public sections. For instance, if users choose to make their generated images public, they can upload them to the “explore” section on ideogram.ai, and we subsequently use these images for training.
+### NVIDIA Internal AV Dataset
+**Data Collection Method:** Sensors
+**Labeling Method by Dataset:** Human
+**Properties:** The dataset contains the autonomous driving image/videos captured by NVIDIA Vehicles. It’s collected by autonomous driving vehicles.[d]
+## Ethical Considerations:
+NVIDIA believes Trustworthy AI is a shared responsibility. We have established policies and practices to enable development for a wide
+array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model
+team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+Please report security vulnerabilities or NVIDIA AI Concerns [through our Product Security team](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).