File size: 3,079 Bytes
563f033
e88a846
2eb1d70
e88a846
 
563f033
 
 
 
e88a846
2eb1d70
563f033
 
e88a846
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
title: FantasyTalking Demo
emoji: 🎬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.34.2
app_file: app.py
pinned: false
license: apache-2.0
short_description: Talking Portrait Generation Demo
---

# FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis

This is a Hugging Face Space demo for the FantasyTalking project, which generates realistic talking portraits from a single image and audio input.

## πŸ”₯ Features

- **Single Image Input**: Generate talking videos from just one portrait image
- **Audio-driven Animation**: Synchronize lip movements with input audio
- **High Quality Output**: 512x512 resolution with up to 81 frames
- **Controllable Generation**: Adjust prompt and audio guidance scales

## πŸ“‹ Requirements

Due to the large model size (~40GB+) and GPU memory requirements, this demo shows the interface but requires local deployment for full functionality.

### System Requirements
- NVIDIA GPU with at least 5GB VRAM (low memory mode)
- 20GB+ VRAM recommended for optimal performance
- 50GB+ storage space for models

## πŸš€ Local Deployment

To run FantasyTalking locally with full functionality:

```bash
# 1. Clone the repository
git clone https://github.com/Fantasy-AMAP/fantasy-talking.git
cd fantasy-talking

# 2. Install dependencies
pip install -r requirements.txt
pip install flash_attn  # Optional, for accelerated attention computation

# 3. Download models
# Base model (~20GB)
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./models/Wan2.1-I2V-14B-720P

# Audio encoder (~1GB)
huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./models/wav2vec2-base-960h

# FantasyTalking weights (~2GB)
huggingface-cli download acvlab/FantasyTalking fantasytalking_model.ckpt --local-dir ./models

# 4. Run inference
python infer.py --image_path ./assets/images/woman.png --audio_path ./assets/audios/woman.wav

# 5. Start web interface
python app.py
```

## 🎯 Performance

Model performance on single A100 (512x512, 81 frames):

| torch_dtype | num_persistent_param_in_dit | Speed | Required VRAM |
|------------|----------------------------|-------|---------------|
| torch.bfloat16 | None (unlimited) | 15.5s/it | 40G |
| torch.bfloat16 | 7Γ—10⁹ (7B) | 32.8s/it | 20G |
| torch.bfloat16 | 0 | 42.6s/it | 5G |

## πŸ“– Citation

```bibtex
@article{wang2025fantasytalking,
   title={FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis},
   author={Wang, Mengchao and Wang, Qiang and Jiang, Fan and Fan, Yaqi and Zhang, Yunpeng and Qi, Yonggang and Zhao, Kun and Xu, Mu},
   journal={arXiv preprint arXiv:2504.04842},
   year={2025}
}
```

## πŸ”— Links

- **Paper**: [arXiv:2504.04842](https://arxiv.org/abs/2504.04842)
- **Code**: [GitHub Repository](https://github.com/Fantasy-AMAP/fantasy-talking)
- **Models**: [Hugging Face](https://huggingface.co/acvlab/FantasyTalking)
- **Project Page**: [FantasyTalking](https://fantasy-amap.github.io/fantasy-talking/)

## πŸ“„ License

This project is licensed under the Apache-2.0 License.