HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
Abstract
HuMo is a unified framework for human-centric video generation that addresses challenges in multimodal control through a two-stage training paradigm and novel strategies for subject preservation and audio-visual synchronization.
Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.
Community
Project Page: https://phantom-video.github.io/HuMo/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation (2025)
- InfinityHuman: Towards Long-Term Audio-Driven Human (2025)
- SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation (2025)
- UniVerse-1: Unified Audio-Video Generation via Stitching of Experts (2025)
- DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework (2025)
- RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer (2025)
- Multi-human Interactive Talking Dataset (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Explanations of this paper: https://arxivexplained.com/papers/humo-human-centric-video-generation-via-collaborative-multi-modal-conditioning
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/humo-human-centric-video-generation-via-collaborative-multi-modal-conditioning
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper