# OmniAvatar-14B Integration - Avatar Video Generation with Adaptive Body Animation This project integrates the powerful [OmniAvatar-14B model](https://huggingface.co/OmniAvatar/OmniAvatar-14B) to provide audio-driven avatar video generation with adaptive body animation. ## 🌟 Features ### Core Capabilities - **Audio-Driven Animation**: Generate realistic avatar videos synchronized with speech - **Adaptive Body Animation**: Dynamic body movements that adapt to speech content - **Multi-Modal Input Support**: Text prompts, audio files, and reference images - **Advanced TTS Integration**: Multiple text-to-speech systems with fallback - **Web Interface**: Both Gradio UI and FastAPI endpoints - **Performance Optimization**: TeaCache acceleration and multi-GPU support ### Technical Features - ✅ **480p Video Generation** with 25fps output - ✅ **Lip-Sync Accuracy** with audio-visual alignment - ✅ **Reference Image Support** for character consistency - ✅ **Prompt-Controlled Behavior** for specific actions and expressions - ✅ **Memory Efficient** with FSDP and gradient checkpointing - ✅ **Scalable** from single GPU to multi-GPU setups ## 🚀 Quick Start ### 1. Setup Environment ```powershell # Clone and navigate to the project cd AI_Avatar_Chat # Install dependencies pip install -r requirements.txt ``` ### 2. Download OmniAvatar Models **Option A: Using PowerShell Script (Windows)** ```powershell # Run the automated setup script .\setup_omniavatar.ps1 ``` **Option B: Using Python Script (Cross-platform)** ```bash # Run the Python setup script python setup_omniavatar.py ``` **Option C: Manual Download** ```bash # Install HuggingFace CLI pip install "huggingface_hub[cli]" # Create directories mkdir -p pretrained_models # Download models (this will take ~30GB) huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./pretrained_models/Wan2.1-T2V-14B huggingface-cli download OmniAvatar/OmniAvatar-14B --local-dir ./pretrained_models/OmniAvatar-14B huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./pretrained_models/wav2vec2-base-960h ``` ### 3. Run the Application ```bash # Start the application python app.py # Access the web interface # Gradio UI: http://localhost:7860/gradio # API docs: http://localhost:7860/docs ``` ## 📖 Usage Guide ### Gradio Web Interface 1. **Enter Character Description**: Describe the avatar's appearance and behavior 2. **Provide Audio Input**: Choose from: - **Text-to-Speech**: Enter text to be spoken (recommended for beginners) - **Audio URL**: Direct link to an audio file 3. **Optional Reference Image**: URL to a reference photo for character consistency 4. **Adjust Parameters**: - **Guidance Scale**: 4-6 recommended (controls prompt adherence) - **Audio Scale**: 3-5 recommended (controls lip-sync accuracy) - **Steps**: 20-50 recommended (quality vs speed trade-off) 5. **Generate**: Click to create your avatar video! ### API Usage ```python import requests # Generate avatar video response = requests.post("http://localhost:7860/generate", json={ "prompt": "A professional teacher explaining concepts with clear gestures", "text_to_speech": "Hello students, today we'll learn about artificial intelligence.", "voice_id": "21m00Tcm4TlvDq8ikWAM", "guidance_scale": 5.0, "audio_scale": 3.5, "num_steps": 30 }) result = response.json() print(f"Video URL: {result['output_path']}") ``` ### Input Formats **Prompt Structure** (based on OmniAvatar paper recommendations): ``` [Character Description] - [Behavior Description] - [Background Description (optional)] ``` **Examples:** - `"A friendly teacher explaining concepts - enthusiastic hand gestures - modern classroom"` - `"Professional news anchor - confident delivery - news studio background"` - `"Casual presenter - relaxed speaking style - home office setting"` ## ⚙️ Configuration ### Performance Optimization Based on your hardware, the system will automatically optimize settings: **High-end GPU (32GB+ VRAM)**: - Full quality: 60000 tokens, unlimited parameters - Speed: ~16s per iteration **Medium GPU (16-32GB VRAM)**: - Balanced: 30000 tokens, 7B parameter limit - Speed: ~19s per iteration **Low-end GPU (8-16GB VRAM)**: - Memory efficient: 15000 tokens, minimal parameters - Speed: ~22s per iteration **Multi-GPU Setup (4+ GPUs)**: - Optimal performance: Sequence parallel processing - Speed: ~4.8s per iteration ### Advanced Settings Edit `configs/inference.yaml` for fine-tuning: ```yaml inference: max_tokens: 30000 # Context length guidance_scale: 4.5 # Prompt adherence audio_scale: 3.0 # Lip-sync strength num_steps: 25 # Quality iterations overlap_frame: 13 # Temporal consistency tea_cache_l1_thresh: 0.14 # Memory optimization generation: resolution: "480p" # Output resolution frame_rate: 25 # Video frame rate duration_seconds: 10 # Max video length ``` ## 🎯 Best Practices ### Prompt Engineering 1. **Be Descriptive**: Include character appearance, behavior, and setting 2. **Use Action Words**: "explaining", "presenting", "demonstrating" 3. **Specify Context**: Professional, casual, educational, etc. ### Audio Guidelines 1. **Clear Speech**: Use high-quality audio with minimal background noise 2. **Appropriate Length**: 5-30 seconds for best results 3. **Natural Pace**: Avoid too fast or too slow speech ### Performance Tips 1. **Start Small**: Use fewer steps (20-25) for testing 2. **Monitor VRAM**: Check GPU memory usage during generation 3. **Batch Processing**: Process multiple samples efficiently ## 📊 Model Information ### Architecture Overview - **Base Model**: Wan2.1-T2V-14B (28GB) - Text-to-video generation - **Avatar Weights**: OmniAvatar-14B (2GB) - LoRA adaptation for avatar animation - **Audio Encoder**: wav2vec2-base-960h (360MB) - Speech feature extraction ### Capabilities - **Resolution**: 480p (higher resolutions planned) - **Duration**: Up to 30 seconds per generation - **Audio Formats**: WAV, MP3, M4A, OGG - **Image Formats**: JPG, PNG, WebP ## 🔧 Troubleshooting ### Common Issues **"Models not found" Error**: - Solution: Run the setup script to download required models - Check: Ensure `pretrained_models/` directory contains all three model folders **CUDA Out of Memory**: - Solution: Reduce `max_tokens` or `num_steps` in configuration - Alternative: Enable FSDP mode for memory efficiency **Slow Generation**: - Check: GPU utilization and VRAM usage - Optimize: Use TeaCache with appropriate threshold (0.05-0.15) - Consider: Multi-GPU setup for faster processing **Audio Sync Issues**: - Increase: `audio_scale` parameter (3.0-5.0) - Check: Audio quality and clarity - Ensure: Proper audio file format ### Performance Monitoring ```bash # Check GPU usage nvidia-smi # Monitor generation progress tail -f logs/generation.log # Test system capabilities python -c "from omniavatar_engine import omni_engine; print(omni_engine.get_model_info())" ``` ## 🔗 Integration Examples ### Custom TTS Integration ```python from omniavatar_engine import omni_engine # Generate with custom audio video_path, time_taken = omni_engine.generate_video( prompt="A friendly teacher explaining AI concepts", audio_path="path/to/your/audio.wav", image_path="path/to/reference/image.jpg", # Optional guidance_scale=5.0, audio_scale=3.5, num_steps=30 ) print(f"Generated video: {video_path} in {time_taken:.1f}s") ``` ### Batch Processing ```python import asyncio from pathlib import Path async def batch_generate(prompts_and_audio): results = [] for prompt, audio_path in prompts_and_audio: try: video_path, time_taken = omni_engine.generate_video( prompt=prompt, audio_path=audio_path ) results.append((video_path, time_taken)) except Exception as e: print(f"Failed to generate for {prompt}: {e}") return results ``` ## 📚 References - **OmniAvatar Paper**: [arXiv:2506.18866](https://arxiv.org/abs/2506.18866) - **Official Repository**: [GitHub - Omni-Avatar/OmniAvatar](https://github.com/Omni-Avatar/OmniAvatar) - **HuggingFace Model**: [OmniAvatar/OmniAvatar-14B](https://huggingface.co/OmniAvatar/OmniAvatar-14B) - **Base Model**: [Wan-AI/Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) ## 🤝 Contributing We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. ## 📄 License This project is licensed under Apache 2.0. See [LICENSE](LICENSE) for details. ## 🙋 Support For questions and support: - 📧 Email: ganqijun@zju.edu.cn (OmniAvatar authors) - 💬 Issues: [GitHub Issues](https://github.com/Omni-Avatar/OmniAvatar/issues) - 📖 Documentation: [Official Docs](https://github.com/Omni-Avatar/OmniAvatar) --- **Citation**: ```bibtex @misc{gan2025omniavatar, title={OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation}, author={Qijun Gan and Ruizi Yang and Jianke Zhu and Shaofei Xue and Steven Hoi}, year={2025}, eprint={2506.18866}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```