File size: 9,249 Bytes
e7ffb7d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
ο»Ώ# OmniAvatar-14B Integration - Avatar Video Generation with Adaptive Body Animation

This project integrates the powerful [OmniAvatar-14B model](https://huggingface.co/OmniAvatar/OmniAvatar-14B) to provide audio-driven avatar video generation with adaptive body animation.

## 🌟 Features

### Core Capabilities
- **Audio-Driven Animation**: Generate realistic avatar videos synchronized with speech
- **Adaptive Body Animation**: Dynamic body movements that adapt to speech content
- **Multi-Modal Input Support**: Text prompts, audio files, and reference images
- **Advanced TTS Integration**: Multiple text-to-speech systems with fallback
- **Web Interface**: Both Gradio UI and FastAPI endpoints
- **Performance Optimization**: TeaCache acceleration and multi-GPU support

### Technical Features
- βœ… **480p Video Generation** with 25fps output
- βœ… **Lip-Sync Accuracy** with audio-visual alignment
- βœ… **Reference Image Support** for character consistency
- βœ… **Prompt-Controlled Behavior** for specific actions and expressions
- βœ… **Memory Efficient** with FSDP and gradient checkpointing
- βœ… **Scalable** from single GPU to multi-GPU setups

## πŸš€ Quick Start

### 1. Setup Environment

```powershell
# Clone and navigate to the project
cd AI_Avatar_Chat

# Install dependencies
pip install -r requirements.txt
```

### 2. Download OmniAvatar Models

**Option A: Using PowerShell Script (Windows)**
```powershell
# Run the automated setup script
.\setup_omniavatar.ps1
```

**Option B: Using Python Script (Cross-platform)**
```bash
# Run the Python setup script
python setup_omniavatar.py
```

**Option C: Manual Download**
```bash
# Install HuggingFace CLI
pip install "huggingface_hub[cli]"

# Create directories
mkdir -p pretrained_models

# Download models (this will take ~30GB)
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./pretrained_models/Wan2.1-T2V-14B
huggingface-cli download OmniAvatar/OmniAvatar-14B --local-dir ./pretrained_models/OmniAvatar-14B
huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./pretrained_models/wav2vec2-base-960h
```

### 3. Run the Application

```bash
# Start the application
python app.py

# Access the web interface
# Gradio UI: http://localhost:7860/gradio
# API docs: http://localhost:7860/docs
```

## πŸ“– Usage Guide

### Gradio Web Interface

1. **Enter Character Description**: Describe the avatar's appearance and behavior
2. **Provide Audio Input**: Choose from:
   - **Text-to-Speech**: Enter text to be spoken (recommended for beginners)
   - **Audio URL**: Direct link to an audio file
3. **Optional Reference Image**: URL to a reference photo for character consistency
4. **Adjust Parameters**:
   - **Guidance Scale**: 4-6 recommended (controls prompt adherence)
   - **Audio Scale**: 3-5 recommended (controls lip-sync accuracy)
   - **Steps**: 20-50 recommended (quality vs speed trade-off)
5. **Generate**: Click to create your avatar video!

### API Usage

```python
import requests

# Generate avatar video
response = requests.post("http://localhost:7860/generate", json={
    "prompt": "A professional teacher explaining concepts with clear gestures",
    "text_to_speech": "Hello students, today we'll learn about artificial intelligence.",
    "voice_id": "21m00Tcm4TlvDq8ikWAM",
    "guidance_scale": 5.0,
    "audio_scale": 3.5,
    "num_steps": 30
})

result = response.json()
print(f"Video URL: {result['output_path']}")
```

### Input Formats

**Prompt Structure** (based on OmniAvatar paper recommendations):
```
[Character Description] - [Behavior Description] - [Background Description (optional)]
```

**Examples:**
- `"A friendly teacher explaining concepts - enthusiastic hand gestures - modern classroom"`
- `"Professional news anchor - confident delivery - news studio background"`
- `"Casual presenter - relaxed speaking style - home office setting"`

## βš™οΈ Configuration

### Performance Optimization

Based on your hardware, the system will automatically optimize settings:

**High-end GPU (32GB+ VRAM)**:
- Full quality: 60000 tokens, unlimited parameters
- Speed: ~16s per iteration

**Medium GPU (16-32GB VRAM)**:
- Balanced: 30000 tokens, 7B parameter limit
- Speed: ~19s per iteration

**Low-end GPU (8-16GB VRAM)**:
- Memory efficient: 15000 tokens, minimal parameters
- Speed: ~22s per iteration

**Multi-GPU Setup (4+ GPUs)**:
- Optimal performance: Sequence parallel processing
- Speed: ~4.8s per iteration

### Advanced Settings

Edit `configs/inference.yaml` for fine-tuning:

```yaml
inference:
  max_tokens: 30000          # Context length
  guidance_scale: 4.5        # Prompt adherence
  audio_scale: 3.0           # Lip-sync strength
  num_steps: 25              # Quality iterations
  overlap_frame: 13          # Temporal consistency
  tea_cache_l1_thresh: 0.14  # Memory optimization

generation:
  resolution: "480p"         # Output resolution
  frame_rate: 25             # Video frame rate
  duration_seconds: 10       # Max video length
```

## 🎯 Best Practices

### Prompt Engineering
1. **Be Descriptive**: Include character appearance, behavior, and setting
2. **Use Action Words**: "explaining", "presenting", "demonstrating"
3. **Specify Context**: Professional, casual, educational, etc.

### Audio Guidelines
1. **Clear Speech**: Use high-quality audio with minimal background noise
2. **Appropriate Length**: 5-30 seconds for best results
3. **Natural Pace**: Avoid too fast or too slow speech

### Performance Tips
1. **Start Small**: Use fewer steps (20-25) for testing
2. **Monitor VRAM**: Check GPU memory usage during generation
3. **Batch Processing**: Process multiple samples efficiently

## πŸ“Š Model Information

### Architecture Overview
- **Base Model**: Wan2.1-T2V-14B (28GB) - Text-to-video generation
- **Avatar Weights**: OmniAvatar-14B (2GB) - LoRA adaptation for avatar animation
- **Audio Encoder**: wav2vec2-base-960h (360MB) - Speech feature extraction

### Capabilities
- **Resolution**: 480p (higher resolutions planned)
- **Duration**: Up to 30 seconds per generation
- **Audio Formats**: WAV, MP3, M4A, OGG
- **Image Formats**: JPG, PNG, WebP

## πŸ”§ Troubleshooting

### Common Issues

**"Models not found" Error**:
- Solution: Run the setup script to download required models
- Check: Ensure `pretrained_models/` directory contains all three model folders

**CUDA Out of Memory**:
- Solution: Reduce `max_tokens` or `num_steps` in configuration
- Alternative: Enable FSDP mode for memory efficiency

**Slow Generation**:
- Check: GPU utilization and VRAM usage
- Optimize: Use TeaCache with appropriate threshold (0.05-0.15)
- Consider: Multi-GPU setup for faster processing

**Audio Sync Issues**:
- Increase: `audio_scale` parameter (3.0-5.0)
- Check: Audio quality and clarity
- Ensure: Proper audio file format

### Performance Monitoring

```bash
# Check GPU usage
nvidia-smi

# Monitor generation progress
tail -f logs/generation.log

# Test system capabilities
python -c "from omniavatar_engine import omni_engine; print(omni_engine.get_model_info())"
```

## πŸ”— Integration Examples

### Custom TTS Integration

```python
from omniavatar_engine import omni_engine

# Generate with custom audio
video_path, time_taken = omni_engine.generate_video(
    prompt="A friendly teacher explaining AI concepts",
    audio_path="path/to/your/audio.wav",
    image_path="path/to/reference/image.jpg",  # Optional
    guidance_scale=5.0,
    audio_scale=3.5,
    num_steps=30
)

print(f"Generated video: {video_path} in {time_taken:.1f}s")
```

### Batch Processing

```python
import asyncio
from pathlib import Path

async def batch_generate(prompts_and_audio):
    results = []
    for prompt, audio_path in prompts_and_audio:
        try:
            video_path, time_taken = omni_engine.generate_video(
                prompt=prompt,
                audio_path=audio_path
            )
            results.append((video_path, time_taken))
        except Exception as e:
            print(f"Failed to generate for {prompt}: {e}")
    return results
```

## πŸ“š References

- **OmniAvatar Paper**: [arXiv:2506.18866](https://arxiv.org/abs/2506.18866)
- **Official Repository**: [GitHub - Omni-Avatar/OmniAvatar](https://github.com/Omni-Avatar/OmniAvatar)
- **HuggingFace Model**: [OmniAvatar/OmniAvatar-14B](https://huggingface.co/OmniAvatar/OmniAvatar-14B)
- **Base Model**: [Wan-AI/Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)

## 🀝 Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## πŸ“„ License

This project is licensed under Apache 2.0. See [LICENSE](LICENSE) for details.

## πŸ™‹ Support

For questions and support:
- πŸ“§ Email: ganqijun@zju.edu.cn (OmniAvatar authors)
- πŸ’¬ Issues: [GitHub Issues](https://github.com/Omni-Avatar/OmniAvatar/issues)
- πŸ“– Documentation: [Official Docs](https://github.com/Omni-Avatar/OmniAvatar)

---

**Citation**:
```bibtex
@misc{gan2025omniavatar,
  title={OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation},
  author={Qijun Gan and Ruizi Yang and Jianke Zhu and Shaofei Xue and Steven Hoi},
  year={2025},
  eprint={2506.18866},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}
```