Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Paper
This model was presented in the paper Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation.
Abstract
Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance.
Links
- Project Page: https://kangliao929.github.io/projects/puffin
- GitHub Repository: https://github.com/KangLiao929/Puffin
- Hugging Face Space: https://huggingface.co/spaces/KangLiao/Puffin
- Hugging Face Dataset: https://huggingface.co/datasets/KangLiao/Puffin-4M
Model Details
Puffin is a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. It learns the camera-centric understanding and generation tasks in a unified multimodal framework. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context.
| Developed by | Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy |
| Affiliation | S-Lab, Nanyang Technological University |
| First released | arXiv pre-print, 2025 |
| Model type | Unified multimodal models (diffusion / autoregressive modelling with camera-centric understanding and generation) |
| Modality | Image β Text+Camera; Text+Camera β Image; Image+Camera β Image; Image+Camera β Text |
Direct Use
- Camera-centric understanding and generation from a single image or a pair of text and camera, supports the thinking mode.
- World exploration: performs the cross-view generation from a given initial view and target camera configuration.
- Spatial imagination: imagines the scene description based on an initial view and target camera configuration.
- 3D virtual object insertion in AR/VR: assists the virtual 3D object insertion into in-the-wild images by calibrating camera parameters
Sample Usage
This section demonstrates how to generate images with camera control using Puffin-Base, based on the examples provided in the GitHub repository.
First, download the model checkpoints from π€ KangLiao/Puffin and organize them in a checkpoints directory, for example:
Puffin/
βββ checkpoints
βββ Puffin-Align.pth # provided for customized SFT
βββ Puffin-Base.pth
βββ Puffin-Thinking.pth
βββ Puffin-Instruct.pth
You can use huggingface-cli to download the checkpoints:
# pip install -U "huggingface_hub[cli]"
huggingface-cli download KangLiao/Puffin --local-dir checkpoints --repo-type model
To run the camera-controllable image generation:
export PYTHONPATH=./:$PYTHONPATH
python scripts/demo/generation.py configs/pipelines/stage_2_base.py \
--checkpoint checkpoints/Puffin-Base.pth --output generation_result.jpg \
--prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
-r -0.3939 -p 0.0277 -f 0.7595
This command generates an image based on the provided text prompt and camera parameters (roll: -r, pitch: -p, vertical field-of-view: -f, all in radians). The output image will be saved as generation_result.jpg.
To enable the thinking mode for image generation, please simply change the settings and append the --thinking flag:
python scripts/demo/generation.py configs/pipelines/stage_3_thinking.py \
--checkpoint checkpoints/Puffin-Thinking.pth --output generation_result_thinking.jpg \
--prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
-r -0.3939 -p 0.0277 -f 0.7595 \
--thinking
Citation
If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX:
@article{liao2025puffin,
title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation},
author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change},
journal={arXiv preprint arXiv:2510.08673},
year={2025}
}
License
This project is licensed under NTU S-Lab License 1.0.