--- title: FantasyTalking Demo emoji: 🎬 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.34.2 app_file: app.py pinned: false license: apache-2.0 short_description: Talking Portrait Generation Demo --- # FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis This is a Hugging Face Space demo for the FantasyTalking project, which generates realistic talking portraits from a single image and audio input. ## 🔥 Features - **Single Image Input**: Generate talking videos from just one portrait image - **Audio-driven Animation**: Synchronize lip movements with input audio - **High Quality Output**: 512x512 resolution with up to 81 frames - **Controllable Generation**: Adjust prompt and audio guidance scales ## 📋 Requirements Due to the large model size (~40GB+) and GPU memory requirements, this demo shows the interface but requires local deployment for full functionality. ### System Requirements - NVIDIA GPU with at least 5GB VRAM (low memory mode) - 20GB+ VRAM recommended for optimal performance - 50GB+ storage space for models ## 🚀 Local Deployment To run FantasyTalking locally with full functionality: ```bash # 1. Clone the repository git clone https://github.com/Fantasy-AMAP/fantasy-talking.git cd fantasy-talking # 2. Install dependencies pip install -r requirements.txt pip install flash_attn # Optional, for accelerated attention computation # 3. Download models # Base model (~20GB) huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./models/Wan2.1-I2V-14B-720P # Audio encoder (~1GB) huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./models/wav2vec2-base-960h # FantasyTalking weights (~2GB) huggingface-cli download acvlab/FantasyTalking fantasytalking_model.ckpt --local-dir ./models # 4. Run inference python infer.py --image_path ./assets/images/woman.png --audio_path ./assets/audios/woman.wav # 5. Start web interface python app.py ``` ## 🎯 Performance Model performance on single A100 (512x512, 81 frames): | torch_dtype | num_persistent_param_in_dit | Speed | Required VRAM | |------------|----------------------------|-------|---------------| | torch.bfloat16 | None (unlimited) | 15.5s/it | 40G | | torch.bfloat16 | 7×10⁹ (7B) | 32.8s/it | 20G | | torch.bfloat16 | 0 | 42.6s/it | 5G | ## 📖 Citation ```bibtex @article{wang2025fantasytalking, title={FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis}, author={Wang, Mengchao and Wang, Qiang and Jiang, Fan and Fan, Yaqi and Zhang, Yunpeng and Qi, Yonggang and Zhao, Kun and Xu, Mu}, journal={arXiv preprint arXiv:2504.04842}, year={2025} } ``` ## 🔗 Links - **Paper**: [arXiv:2504.04842](https://arxiv.org/abs/2504.04842) - **Code**: [GitHub Repository](https://github.com/Fantasy-AMAP/fantasy-talking) - **Models**: [Hugging Face](https://huggingface.co/acvlab/FantasyTalking) - **Project Page**: [FantasyTalking](https://fantasy-amap.github.io/fantasy-talking/) ## 📄 License This project is licensed under the Apache-2.0 License.