Spaces:
Running
Running
File size: 9,249 Bytes
e7ffb7d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 |
ο»Ώ# OmniAvatar-14B Integration - Avatar Video Generation with Adaptive Body Animation
This project integrates the powerful [OmniAvatar-14B model](https://huggingface.co/OmniAvatar/OmniAvatar-14B) to provide audio-driven avatar video generation with adaptive body animation.
## π Features
### Core Capabilities
- **Audio-Driven Animation**: Generate realistic avatar videos synchronized with speech
- **Adaptive Body Animation**: Dynamic body movements that adapt to speech content
- **Multi-Modal Input Support**: Text prompts, audio files, and reference images
- **Advanced TTS Integration**: Multiple text-to-speech systems with fallback
- **Web Interface**: Both Gradio UI and FastAPI endpoints
- **Performance Optimization**: TeaCache acceleration and multi-GPU support
### Technical Features
- β
**480p Video Generation** with 25fps output
- β
**Lip-Sync Accuracy** with audio-visual alignment
- β
**Reference Image Support** for character consistency
- β
**Prompt-Controlled Behavior** for specific actions and expressions
- β
**Memory Efficient** with FSDP and gradient checkpointing
- β
**Scalable** from single GPU to multi-GPU setups
## π Quick Start
### 1. Setup Environment
```powershell
# Clone and navigate to the project
cd AI_Avatar_Chat
# Install dependencies
pip install -r requirements.txt
```
### 2. Download OmniAvatar Models
**Option A: Using PowerShell Script (Windows)**
```powershell
# Run the automated setup script
.\setup_omniavatar.ps1
```
**Option B: Using Python Script (Cross-platform)**
```bash
# Run the Python setup script
python setup_omniavatar.py
```
**Option C: Manual Download**
```bash
# Install HuggingFace CLI
pip install "huggingface_hub[cli]"
# Create directories
mkdir -p pretrained_models
# Download models (this will take ~30GB)
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./pretrained_models/Wan2.1-T2V-14B
huggingface-cli download OmniAvatar/OmniAvatar-14B --local-dir ./pretrained_models/OmniAvatar-14B
huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./pretrained_models/wav2vec2-base-960h
```
### 3. Run the Application
```bash
# Start the application
python app.py
# Access the web interface
# Gradio UI: http://localhost:7860/gradio
# API docs: http://localhost:7860/docs
```
## π Usage Guide
### Gradio Web Interface
1. **Enter Character Description**: Describe the avatar's appearance and behavior
2. **Provide Audio Input**: Choose from:
- **Text-to-Speech**: Enter text to be spoken (recommended for beginners)
- **Audio URL**: Direct link to an audio file
3. **Optional Reference Image**: URL to a reference photo for character consistency
4. **Adjust Parameters**:
- **Guidance Scale**: 4-6 recommended (controls prompt adherence)
- **Audio Scale**: 3-5 recommended (controls lip-sync accuracy)
- **Steps**: 20-50 recommended (quality vs speed trade-off)
5. **Generate**: Click to create your avatar video!
### API Usage
```python
import requests
# Generate avatar video
response = requests.post("http://localhost:7860/generate", json={
"prompt": "A professional teacher explaining concepts with clear gestures",
"text_to_speech": "Hello students, today we'll learn about artificial intelligence.",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"guidance_scale": 5.0,
"audio_scale": 3.5,
"num_steps": 30
})
result = response.json()
print(f"Video URL: {result['output_path']}")
```
### Input Formats
**Prompt Structure** (based on OmniAvatar paper recommendations):
```
[Character Description] - [Behavior Description] - [Background Description (optional)]
```
**Examples:**
- `"A friendly teacher explaining concepts - enthusiastic hand gestures - modern classroom"`
- `"Professional news anchor - confident delivery - news studio background"`
- `"Casual presenter - relaxed speaking style - home office setting"`
## βοΈ Configuration
### Performance Optimization
Based on your hardware, the system will automatically optimize settings:
**High-end GPU (32GB+ VRAM)**:
- Full quality: 60000 tokens, unlimited parameters
- Speed: ~16s per iteration
**Medium GPU (16-32GB VRAM)**:
- Balanced: 30000 tokens, 7B parameter limit
- Speed: ~19s per iteration
**Low-end GPU (8-16GB VRAM)**:
- Memory efficient: 15000 tokens, minimal parameters
- Speed: ~22s per iteration
**Multi-GPU Setup (4+ GPUs)**:
- Optimal performance: Sequence parallel processing
- Speed: ~4.8s per iteration
### Advanced Settings
Edit `configs/inference.yaml` for fine-tuning:
```yaml
inference:
max_tokens: 30000 # Context length
guidance_scale: 4.5 # Prompt adherence
audio_scale: 3.0 # Lip-sync strength
num_steps: 25 # Quality iterations
overlap_frame: 13 # Temporal consistency
tea_cache_l1_thresh: 0.14 # Memory optimization
generation:
resolution: "480p" # Output resolution
frame_rate: 25 # Video frame rate
duration_seconds: 10 # Max video length
```
## π― Best Practices
### Prompt Engineering
1. **Be Descriptive**: Include character appearance, behavior, and setting
2. **Use Action Words**: "explaining", "presenting", "demonstrating"
3. **Specify Context**: Professional, casual, educational, etc.
### Audio Guidelines
1. **Clear Speech**: Use high-quality audio with minimal background noise
2. **Appropriate Length**: 5-30 seconds for best results
3. **Natural Pace**: Avoid too fast or too slow speech
### Performance Tips
1. **Start Small**: Use fewer steps (20-25) for testing
2. **Monitor VRAM**: Check GPU memory usage during generation
3. **Batch Processing**: Process multiple samples efficiently
## π Model Information
### Architecture Overview
- **Base Model**: Wan2.1-T2V-14B (28GB) - Text-to-video generation
- **Avatar Weights**: OmniAvatar-14B (2GB) - LoRA adaptation for avatar animation
- **Audio Encoder**: wav2vec2-base-960h (360MB) - Speech feature extraction
### Capabilities
- **Resolution**: 480p (higher resolutions planned)
- **Duration**: Up to 30 seconds per generation
- **Audio Formats**: WAV, MP3, M4A, OGG
- **Image Formats**: JPG, PNG, WebP
## π§ Troubleshooting
### Common Issues
**"Models not found" Error**:
- Solution: Run the setup script to download required models
- Check: Ensure `pretrained_models/` directory contains all three model folders
**CUDA Out of Memory**:
- Solution: Reduce `max_tokens` or `num_steps` in configuration
- Alternative: Enable FSDP mode for memory efficiency
**Slow Generation**:
- Check: GPU utilization and VRAM usage
- Optimize: Use TeaCache with appropriate threshold (0.05-0.15)
- Consider: Multi-GPU setup for faster processing
**Audio Sync Issues**:
- Increase: `audio_scale` parameter (3.0-5.0)
- Check: Audio quality and clarity
- Ensure: Proper audio file format
### Performance Monitoring
```bash
# Check GPU usage
nvidia-smi
# Monitor generation progress
tail -f logs/generation.log
# Test system capabilities
python -c "from omniavatar_engine import omni_engine; print(omni_engine.get_model_info())"
```
## π Integration Examples
### Custom TTS Integration
```python
from omniavatar_engine import omni_engine
# Generate with custom audio
video_path, time_taken = omni_engine.generate_video(
prompt="A friendly teacher explaining AI concepts",
audio_path="path/to/your/audio.wav",
image_path="path/to/reference/image.jpg", # Optional
guidance_scale=5.0,
audio_scale=3.5,
num_steps=30
)
print(f"Generated video: {video_path} in {time_taken:.1f}s")
```
### Batch Processing
```python
import asyncio
from pathlib import Path
async def batch_generate(prompts_and_audio):
results = []
for prompt, audio_path in prompts_and_audio:
try:
video_path, time_taken = omni_engine.generate_video(
prompt=prompt,
audio_path=audio_path
)
results.append((video_path, time_taken))
except Exception as e:
print(f"Failed to generate for {prompt}: {e}")
return results
```
## π References
- **OmniAvatar Paper**: [arXiv:2506.18866](https://arxiv.org/abs/2506.18866)
- **Official Repository**: [GitHub - Omni-Avatar/OmniAvatar](https://github.com/Omni-Avatar/OmniAvatar)
- **HuggingFace Model**: [OmniAvatar/OmniAvatar-14B](https://huggingface.co/OmniAvatar/OmniAvatar-14B)
- **Base Model**: [Wan-AI/Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)
## π€ Contributing
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
## π License
This project is licensed under Apache 2.0. See [LICENSE](LICENSE) for details.
## π Support
For questions and support:
- π§ Email: ganqijun@zju.edu.cn (OmniAvatar authors)
- π¬ Issues: [GitHub Issues](https://github.com/Omni-Avatar/OmniAvatar/issues)
- π Documentation: [Official Docs](https://github.com/Omni-Avatar/OmniAvatar)
---
**Citation**:
```bibtex
@misc{gan2025omniavatar,
title={OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation},
author={Qijun Gan and Ruizi Yang and Jianke Zhu and Shaofei Xue and Steven Hoi},
year={2025},
eprint={2506.18866},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
|