Spaces:

bravedims
/

AI_Avatar_Chat

Running

File size: 9,249 Bytes

e7ffb7d

# OmniAvatar-14B Integration - Avatar Video Generation with Adaptive Body Animation

This project integrates the powerful [OmniAvatar-14B model](https://huggingface.co/OmniAvatar/OmniAvatar-14B) to provide audio-driven avatar video generation with adaptive body animation.

## 🌟 Features

### Core Capabilities
- **Audio-Driven Animation**: Generate realistic avatar videos synchronized with speech
- **Adaptive Body Animation**: Dynamic body movements that adapt to speech content
- **Multi-Modal Input Support**: Text prompts, audio files, and reference images
- **Advanced TTS Integration**: Multiple text-to-speech systems with fallback
- **Web Interface**: Both Gradio UI and FastAPI endpoints
- **Performance Optimization**: TeaCache acceleration and multi-GPU support

### Technical Features
- ✅ **480p Video Generation** with 25fps output
- ✅ **Lip-Sync Accuracy** with audio-visual alignment
- ✅ **Reference Image Support** for character consistency
- ✅ **Prompt-Controlled Behavior** for specific actions and expressions
- ✅ **Memory Efficient** with FSDP and gradient checkpointing
- ✅ **Scalable** from single GPU to multi-GPU setups

## 🚀 Quick Start

### 1. Setup Environment

```powershell
# Clone and navigate to the project
cd AI_Avatar_Chat

# Install dependencies
pip install -r requirements.txt
```

### 2. Download OmniAvatar Models

**Option A: Using PowerShell Script (Windows)**
```powershell
# Run the automated setup script
.\setup_omniavatar.ps1
```

**Option B: Using Python Script (Cross-platform)**
```bash
# Run the Python setup script
python setup_omniavatar.py
```

**Option C: Manual Download**
```bash
# Install HuggingFace CLI
pip install "huggingface_hub[cli]"

# Create directories
mkdir -p pretrained_models

# Download models (this will take ~30GB)
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./pretrained_models/Wan2.1-T2V-14B
huggingface-cli download OmniAvatar/OmniAvatar-14B --local-dir ./pretrained_models/OmniAvatar-14B
huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./pretrained_models/wav2vec2-base-960h
```

### 3. Run the Application

```bash
# Start the application
python app.py

# Access the web interface
# Gradio UI: http://localhost:7860/gradio
# API docs: http://localhost:7860/docs
```

## 📖 Usage Guide

### Gradio Web Interface

1. **Enter Character Description**: Describe the avatar's appearance and behavior
2. **Provide Audio Input**: Choose from:
   - **Text-to-Speech**: Enter text to be spoken (recommended for beginners)
   - **Audio URL**: Direct link to an audio file
3. **Optional Reference Image**: URL to a reference photo for character consistency
4. **Adjust Parameters**:
   - **Guidance Scale**: 4-6 recommended (controls prompt adherence)
   - **Audio Scale**: 3-5 recommended (controls lip-sync accuracy)
   - **Steps**: 20-50 recommended (quality vs speed trade-off)
5. **Generate**: Click to create your avatar video!

### API Usage

```python
import requests

# Generate avatar video
response = requests.post("http://localhost:7860/generate", json={
    "prompt": "A professional teacher explaining concepts with clear gestures",
    "text_to_speech": "Hello students, today we'll learn about artificial intelligence.",
    "voice_id": "21m00Tcm4TlvDq8ikWAM",
    "guidance_scale": 5.0,
    "audio_scale": 3.5,
    "num_steps": 30
})

result = response.json()
print(f"Video URL: {result['output_path']}")
```

### Input Formats

**Prompt Structure** (based on OmniAvatar paper recommendations):
```
[Character Description] - [Behavior Description] - [Background Description (optional)]
```

**Examples:**
- `"A friendly teacher explaining concepts - enthusiastic hand gestures - modern classroom"`
- `"Professional news anchor - confident delivery - news studio background"`
- `"Casual presenter - relaxed speaking style - home office setting"`

## ⚙️ Configuration

### Performance Optimization

Based on your hardware, the system will automatically optimize settings:

**High-end GPU (32GB+ VRAM)**:
- Full quality: 60000 tokens, unlimited parameters
- Speed: ~16s per iteration

**Medium GPU (16-32GB VRAM)**:
- Balanced: 30000 tokens, 7B parameter limit
- Speed: ~19s per iteration

**Low-end GPU (8-16GB VRAM)**:
- Memory efficient: 15000 tokens, minimal parameters
- Speed: ~22s per iteration

**Multi-GPU Setup (4+ GPUs)**:
- Optimal performance: Sequence parallel processing
- Speed: ~4.8s per iteration

### Advanced Settings

Edit `configs/inference.yaml` for fine-tuning:

```yaml
inference:
  max_tokens: 30000          # Context length
  guidance_scale: 4.5        # Prompt adherence
  audio_scale: 3.0           # Lip-sync strength
  num_steps: 25              # Quality iterations
  overlap_frame: 13          # Temporal consistency
  tea_cache_l1_thresh: 0.14  # Memory optimization

generation:
  resolution: "480p"         # Output resolution
  frame_rate: 25             # Video frame rate
  duration_seconds: 10       # Max video length
```

## 🎯 Best Practices

### Prompt Engineering
1. **Be Descriptive**: Include character appearance, behavior, and setting
2. **Use Action Words**: "explaining", "presenting", "demonstrating"
3. **Specify Context**: Professional, casual, educational, etc.

### Audio Guidelines
1. **Clear Speech**: Use high-quality audio with minimal background noise
2. **Appropriate Length**: 5-30 seconds for best results
3. **Natural Pace**: Avoid too fast or too slow speech

### Performance Tips
1. **Start Small**: Use fewer steps (20-25) for testing
2. **Monitor VRAM**: Check GPU memory usage during generation
3. **Batch Processing**: Process multiple samples efficiently

## 📊 Model Information

### Architecture Overview
- **Base Model**: Wan2.1-T2V-14B (28GB) - Text-to-video generation
- **Avatar Weights**: OmniAvatar-14B (2GB) - LoRA adaptation for avatar animation
- **Audio Encoder**: wav2vec2-base-960h (360MB) - Speech feature extraction

### Capabilities
- **Resolution**: 480p (higher resolutions planned)
- **Duration**: Up to 30 seconds per generation
- **Audio Formats**: WAV, MP3, M4A, OGG
- **Image Formats**: JPG, PNG, WebP

## 🔧 Troubleshooting

### Common Issues

**"Models not found" Error**:
- Solution: Run the setup script to download required models
- Check: Ensure `pretrained_models/` directory contains all three model folders

**CUDA Out of Memory**:
- Solution: Reduce `max_tokens` or `num_steps` in configuration
- Alternative: Enable FSDP mode for memory efficiency

**Slow Generation**:
- Check: GPU utilization and VRAM usage
- Optimize: Use TeaCache with appropriate threshold (0.05-0.15)
- Consider: Multi-GPU setup for faster processing

**Audio Sync Issues**:
- Increase: `audio_scale` parameter (3.0-5.0)
- Check: Audio quality and clarity
- Ensure: Proper audio file format

### Performance Monitoring

```bash
# Check GPU usage
nvidia-smi

# Monitor generation progress
tail -f logs/generation.log

# Test system capabilities
python -c "from omniavatar_engine import omni_engine; print(omni_engine.get_model_info())"
```

## 🔗 Integration Examples

### Custom TTS Integration

```python
from omniavatar_engine import omni_engine

# Generate with custom audio
video_path, time_taken = omni_engine.generate_video(
    prompt="A friendly teacher explaining AI concepts",
    audio_path="path/to/your/audio.wav",
    image_path="path/to/reference/image.jpg",  # Optional
    guidance_scale=5.0,
    audio_scale=3.5,
    num_steps=30
)

print(f"Generated video: {video_path} in {time_taken:.1f}s")
```

### Batch Processing

```python
import asyncio
from pathlib import Path

async def batch_generate(prompts_and_audio):
    results = []
    for prompt, audio_path in prompts_and_audio:
        try:
            video_path, time_taken = omni_engine.generate_video(
                prompt=prompt,
                audio_path=audio_path
            )
            results.append((video_path, time_taken))
        except Exception as e:
            print(f"Failed to generate for {prompt}: {e}")
    return results
```

## 📚 References

- **OmniAvatar Paper**: [arXiv:2506.18866](https://arxiv.org/abs/2506.18866)
- **Official Repository**: [GitHub - Omni-Avatar/OmniAvatar](https://github.com/Omni-Avatar/OmniAvatar)
- **HuggingFace Model**: [OmniAvatar/OmniAvatar-14B](https://huggingface.co/OmniAvatar/OmniAvatar-14B)
- **Base Model**: [Wan-AI/Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)

## 🤝 Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## 📄 License

This project is licensed under Apache 2.0. See [LICENSE](LICENSE) for details.

## 🙋 Support

For questions and support:
- 📧 Email: ganqijun@zju.edu.cn (OmniAvatar authors)
- 💬 Issues: [GitHub Issues](https://github.com/Omni-Avatar/OmniAvatar/issues)
- 📖 Documentation: [Official Docs](https://github.com/Omni-Avatar/OmniAvatar)

---

**Citation**:
```bibtex
@misc{gan2025omniavatar,
  title={OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation},
  author={Qijun Gan and Ruizi Yang and Jianke Zhu and Shaofei Xue and Steven Hoi},
  year={2025},
  eprint={2506.18866},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}
```