Diffusers
Safetensors
English
DifixPipeline

Adding the Model Card and usage scripts — v0

#1
by alexg33 - opened
Files changed (1) hide show
  1. README.md +170 -0
README.md ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NVIDIA Fixer Model: Difix3D+
2
+
3
+ [Project Page](https://research.nvidia.com/labs/toronto-ai/difix3d/) | [Code](https://github.com/nv-tlabs/Difix3D) | [Paper](https://arxiv.org/abs/2503.01774)
4
+
5
+ Fixer is a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by
6
+ underconstrained regions of 3D representation. The technology behind Fixer is based on the concepts outlined in the paper titled
7
+ [DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models](https://arxiv.org/abs/2503.01774 ).
8
+
9
+ Fixer has two operation modes:
10
+
11
+ * Offline mode: Used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction
12
+ and then distill them back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality.
13
+ * Online mode: Acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D
14
+ supervision and the limited capacity of current reconstruction models.
15
+
16
+ Fixer is an all-encompassing solution, a single model compatible for both NeRF and 3DGS representations.
17
+
18
+ This model is ready for commercial use.
19
+
20
+ **License/Terms of Use:** Your use of the software container is governed by the NVIDIA Software License Agreement and Product-Specific Terms for NVIDIA AI Products. Your use of the model is governed by the NVIDIA Open Model License Agreement.[a]
21
+ **Deployment Geography:** Global
22
+ **Use Case:** Fixer is intended for Autonomous Vehicles developers to enhance and improve their Neural Reconstruction pipelines. The model takes an image as an input and outputs a fixed image.
23
+ **Release Date:** V1: June 2025
24
+
25
+
26
+ ### Model Sources
27
+
28
+ - **Repository:** [GitHub](https://github.com/nv-tlabs/Difix3D)
29
+ - **Paper [optional]:** [DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models]https://arxiv.org/abs/2503.01774
30
+
31
+ ## Use the Fixer Model
32
+
33
+ 1. **Set up your environment.** Run the following command to set up your environment:
34
+
35
+ ```bash
36
+ pip install -r requirements.txt
37
+ ```
38
+ 2. **Prepare your data.** Use the following JSON format for your data:
39
+
40
+ ```json
41
+ {
42
+ "train": {
43
+ "{data_id}": {
44
+ "image": "{PATH_TO_IMAGE}",
45
+ "target_image": "{PATH_TO_TARGET_IMAGE}",
46
+ "ref_image": "{PATH_TO_REF_IMAGE}",
47
+ "prompt": "remove degradation"
48
+ }
49
+ },
50
+ "test": {
51
+ "{data_id}": {
52
+ "image": "{PATH_TO_IMAGE}",
53
+ "target_image": "{PATH_TO_TARGET_IMAGE}",
54
+ "ref_image": "{PATH_TO_REF_IMAGE}",
55
+ "prompt": "remove degradation"
56
+ }
57
+ }
58
+ }
59
+ ```
60
+ 3. **Run the model.** Modify the following scripts for your use case:
61
+
62
+ **Single GPU**
63
+
64
+ ```bash
65
+ accelerate launch --mixed_precision=bf16 src/train_difix.py \
66
+ --output_dir=./outputs/difix/train \
67
+ --dataset_path="data/data.json" \
68
+ --max_train_steps 100000 \
69
+ --resolution=512 --learning_rate 2e-5 \
70
+ --train_batch_size=1 --dataloader_num_workers 8 \
71
+ --enable_xformers_memory_efficient_attention \
72
+ --checkpointing_steps=1000 --eval_freq 1000 --viz_freq 100 \
73
+ --lambda_lpips 1.0 --lambda_l2 1.0 --lambda_gram 1.0 --gram_loss_warmup_steps 2000 \
74
+ --report_to "wandb" --tracker_project_name "difix" --tracker_run_name "train" --timestep 199
75
+ ```
76
+
77
+ **Multipe GPUs**
78
+
79
+ ```bash
80
+ export NUM_NODES=1
81
+ export NUM_GPUS=8
82
+ accelerate launch --mixed_precision=bf16 --main_process_port 29501 --multi_gpu --num_machines $NUM_NODES --num_processes $NUM_GPUS src/train_difix.py \
83
+ --output_dir=./outputs/difix/train \
84
+ --dataset_path="data/data.json" \
85
+ --max_train_steps 100000 \
86
+ --resolution=512 --learning_rate 2e-5 \
87
+ --train_batch_size=1 --dataloader_num_workers 8 \
88
+ --enable_xformers_memory_efficient_attention \
89
+ --checkpointing_steps=1000 --eval_freq 1000 --viz_freq 100 \
90
+ --lambda_lpips 1.0 --lambda_l2 1.0 --lambda_gram 1.0 --gram_loss_warmup_steps 2000 \
91
+ --report_to "wandb" --tracker_project_name "difix" --tracker_run_name "train" --timestep 199
92
+ ```
93
+
94
+ ## Inference
95
+
96
+ Download the pre-trained model from [Google Drive](https://drive.google.com/file/d/1BJPTf02yABE6wneRkndudg87ECZ2oAcS/view?usp=sharing) and place it in the `checkpoints` directory.
97
+
98
+ ```bash
99
+ python src/inference_difix.py \
100
+ --model_path "checkpoints/model.pkl" \
101
+ --input_image "assets/example_input.png" \
102
+ --prompt "remove degradation" \
103
+ --output_dir "outputs/inference" \
104
+ --timestep 199
105
+ ```
106
+
107
+ **Engine:** PyTorch>=2.0.0
108
+ **Test Hardware:**
109
+ We tested on H100, A100. Inference time using 1XA100, 32bit is 0.355 seconds.[e]
110
+
111
+ ## Citation
112
+
113
+ ```bibtex
114
+ @inproceedings{wu2025difix3d,
115
+ title={Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models},
116
+ author={Wu, Jay Zhangjie and Zhang, Yuxuan and Turki, Haithem and Ren, Xuanchi and Gao, Jun and Shou, Mike Zheng and Fidler, Sanja and Gojcic, Zan and Ling, Huan},
117
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
118
+ year={2025}
119
+ }
120
+ ```
121
+
122
+ ## Fixer Model Details
123
+
124
+ **Network Architecture:** Linear-attention Diffusion Transformer with a Deep Compression Autoencoder (DC-AE) for efficient high-resolution image generation.
125
+
126
+ ### Input
127
+
128
+ **Input Type(s):** Image
129
+ **Input Format(s):** Red, Green, Blue (RGB)
130
+ **Input Parameters:** Two-Dimensional (2D)
131
+ **Other Properties Related to Input:** Specific Resolution: [576px x 1024px]
132
+
133
+ ### Output
134
+
135
+ **Output Type(s):** Image
136
+ **Output Format(s):** Red, Green, Blue (RGB)
137
+ **Output Parameters:** Two-Dimensional (2D)
138
+ **Other Properties Related to Output:** Specific Resolution: [576px x 1024px]
139
+
140
+ ### Software Integration:
141
+ **Runtime Engine:** PyTorch
142
+ **Supported Hardware Microarchitecture Compatibility:** NVIDIA Ampere
143
+ **[Preferred/Supported] Operating System(s):** Linux
144
+
145
+ ## Model Version(s):
146
+ SanaMS_1600M_P1_D20[c]
147
+
148
+
149
+ ## Training, Testing, and Evaluation Datasets:
150
+
151
+ Fixer was trained, tested, and evaluated using 2 datasets, where 80% of the data was used for training, 10% for evaluation, and 10% for testing.
152
+
153
+ ### Fixer Dataset
154
+
155
+ **Data Collection Method:** Human
156
+ **Labeling Method by Dataset:** Human
157
+ **Properties:** The dataset is collected by the NVIDIA Sana team. The dataset contains only AI-generated images, directly obtained from the public section of the official websites: ideogram.ai/t/explore and midjourney.com/explore. Please note that we do not use the Ideogram or Midjourney APIs to generate these images. Instead, we source images that have been created and uploaded by platform users to the public sections. For instance, if users choose to make their generated images public, they can upload them to the “explore” section on ideogram.ai, and we subsequently use these images for training.
158
+
159
+ ### NVIDIA Internal AV Dataset
160
+
161
+ **Data Collection Method:** Sensors
162
+ **Labeling Method by Dataset:** Human
163
+ **Properties:** The dataset contains the autonomous driving image/videos captured by NVIDIA Vehicles. It’s collected by autonomous driving vehicles.[d]
164
+
165
+ ## Ethical Considerations:
166
+ NVIDIA believes Trustworthy AI is a shared responsibility. We have established policies and practices to enable development for a wide
167
+ array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model
168
+ team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
169
+
170
+ Please report security vulnerabilities or NVIDIA AI Concerns [through our Product Security team](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).