π΅ Multilingual Audio Intelligence System
New Features β¨
Demo Mode with Professional Audio Files
- Yuri Kizaki - Japanese Audio: Professional voice message about website communication (23 seconds)
- French Film Podcast: Discussion about movies including Social Network and Paranormal Activity (25 seconds)
- Smart demo file management with automatic download and preprocessing
- Instant results with cached processing for blazing-fast demonstration
Enhanced User Interface
- Audio Waveform Visualization: Real-time waveform display with HTML5 Canvas
- Interactive Demo Selection: Beautiful cards for selecting demo audio files
- Improved Transcript Display: Color-coded confidence levels and clear translation sections
- Professional Audio Preview: Audio player with waveform visualization
Technical Improvements
- Automatic demo file download from original sources
- Cached preprocessing results for instant demo response
- Enhanced error handling for missing or corrupted demo files
- Web Audio API integration for dynamic waveform generation
Quick Start
# Install dependencies
pip install -r requirements.txt
# Start the application (includes demo file setup)
python run_fastapi.py
# Access the application
# http://127.0.0.1:8000
Demo Mode Usage
- Select Demo Mode: Click the "Demo Mode" button in the header
- Choose Audio File: Select either Japanese or French demo audio
- Preview: Listen to the audio and view the waveform
- Process: Click "Process Audio" for instant results
- Explore: View transcripts, translations, and analytics
Full Processing Mode
- Upload Audio: Drag & drop or click to upload your audio file
- Preview: View waveform and listen to your audio
- Configure: Select model size and target language
- Process: Real-time processing with progress tracking
- Download: Export results in JSON, SRT, or TXT format
Features
System Architecture
Core Components
- FastAPI Backend - Production-ready web framework
- HTML/TailwindCSS Frontend - Clean, professional interface
- Audio Processing Pipeline - Integrated ML models
- RESTful API - Standardized endpoints
Key Features
- Speaker Diarization - Identify "who spoke when"
- Speech Recognition - Convert speech to text
- Language Detection - Automatic language identification
- Neural Translation - Multi-language translation
- Interactive Visualization - Waveform analysis
- Multiple Export Formats - JSON, SRT, TXT
Technology Stack
Backend
- FastAPI - Modern Python web framework
- Uvicorn - ASGI server
- PyTorch - Deep learning framework
- pyannote.audio - Speaker diarization
- faster-whisper - Speech recognition
- Helsinki-NLP - Neural translation
Frontend
- HTML5/CSS3 - Clean markup
- TailwindCSS - Utility-first styling
- JavaScript (Vanilla) - Client-side logic
- Plotly.js - Interactive visualizations
- Font Awesome - Professional icons
API Endpoints
Core Endpoints
GET /- Main application interfacePOST /api/upload- Upload and process audioGET /api/status/{task_id}- Check processing statusGET /api/results/{task_id}- Retrieve resultsGET /api/download/{task_id}/{format}- Download outputs
Demo Endpoints
POST /api/demo-process- Quick demo processingGET /api/system-info- System information
File Structure
audio_challenge/
βββ web_app.py # FastAPI application
βββ run_fastapi.py # Startup script
βββ requirements.txt # Dependencies
βββ templates/
β βββ index.html # Main interface
βββ src/ # Core modules
β βββ main.py # Pipeline orchestrator
β βββ audio_processor.py # Audio preprocessing
β βββ speaker_diarizer.py # Speaker identification
β βββ speech_recognizer.py # ASR with language detection
β βββ translator.py # Neural machine translation
β βββ output_formatter.py # Output generation
β βββ utils.py # Utility functions
βββ static/ # Static assets
βββ uploads/ # Uploaded files
βββ outputs/ # Generated outputs
βββ README.md
Configuration
Environment Variables
Create a .env file:
HUGGINGFACE_TOKEN=hf_your_token_here # Optional, for gated models
Model Configuration
- Whisper Model: tiny/small/medium/large
- Target Language: en/es/fr/de/it/pt/zh/ja/ko/ar
- Device: auto/cpu/cuda
Supported Audio Formats
- WAV (recommended)
- MP3
- OGG
- FLAC
- M4A
Maximum file size: 100MB
Recommended duration: Under 30 minutes
Development
Local Development
python run_fastapi.py
Production Deployment
uvicorn web_app:app --host 0.0.0.0 --port 8000
Performance
- Processing Speed: 2-14x real-time (depending on model size)
- Memory Usage: Optimized with INT8 quantization
- CPU Optimized: Works without GPU
- Concurrent Processing: Async/await support
Troubleshooting
Common Issues
- Dependencies: Use
requirements.txtfor clean installation - Memory: Use smaller models (tiny/small) for limited hardware
- Audio Format: Convert to WAV if other formats fail
- Port Conflicts: Change port in
run_fastapi.pyif 8000 is occupied
Error Resolution
- Check logs in terminal output
- Verify audio file format and size
- Ensure all dependencies are installed
- Check available system memory
License
MIT License - See LICENSE file for details
Support
- Documentation: Check
/api/docsendpoint - System Info: Use the info button in the web interface
- Logs: Monitor terminal output for detailed information