Spaces:
				
			
			
	
			
			
		Runtime error
		
	
	
	
			
			
	
	
	
	
		
		
		Runtime error
		
	A newer version of the Gradio SDK is available:
									5.49.1
metadata
			title: asr-multi-model
emoji: π
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
π€ Multi-Model ASR Speech Recognition
A comprehensive Automatic Speech Recognition (ASR) application with support for multiple models and Word Error Rate (WER) analysis.
β¨ Features
- π€ Multiple Models: Support for Wav2Vec2 and Whisper models
 - π€ Audio Recording: Direct microphone recording
 - π File Upload: Support for various audio formats
 - π WER Analysis: Calculate Word Error Rate with detailed breakdown
 - πΎ Memory Efficient: Dynamic model loading and cleanup
 - π Multilingual: Whisper models support multiple languages
 
π Live Demo
This application is hosted on Hugging Face Spaces. You can access it at: [Your HF Spaces URL]
π€ Available Models
| Model | Type | Speed | Accuracy | Memory | Best For | 
|---|---|---|---|---|---|
| Wav2Vec2 Base (100h) | Wav2Vec2 | β‘ Fast | π‘ Good | ~300MB | Basic tasks | 
| Wav2Vec2 Base (960h) | Wav2Vec2 | π‘ Balanced | π’ Better | ~1GB | General use | 
| Wav2Vec2 Large (960h) | Wav2Vec2 | π Slower | π΄ High | ~3GB | Difficult audio | 
| Whisper Large V3 Turbo | Whisper | π Slower | π΄ Best | ~5GB | Multilingual | 
π How to Use
- Select Model: Choose from available Wav2Vec2 and Whisper models
 - Load Model: Click 'Load Model' to load the selected model
 - Record/Upload: Record audio or upload an audio file
 - Transcribe: Click 'Transcribe' or wait for auto-transcription
 - WER Analysis: Enter reference text to calculate Word Error Rate
 - Copy Text: Use 'Copy Text' to copy the result
 
π§ Technical Details
Models Used
- Wav2Vec2: Facebook's self-supervised speech recognition models
 - Whisper: OpenAI's multilingual speech recognition model
 
Audio Processing
- Automatic resampling to 16kHz
 - Mono conversion for stereo audio
 - Audio normalization
 - Support for various formats (MP3, WAV, M4A, FLAC)
 
WER Calculation
- Uses edit distance for accurate alignment
 - Normalizes text (lowercase, no punctuation)
 - Provides detailed breakdown of insertions, deletions, and substitutions
 
π οΈ Local Development
Prerequisites
- Python 3.8+
 - CUDA-compatible GPU (optional, for faster inference)
 
Installation
git clone [your-repo-url]
cd [your-repo-name]
pip install -r requirements.txt
python app.py
Requirements
gradio>=4.44.0
torch>=2.6.0
torchaudio>=2.6.0
transformers>=4.36.2
librosa>=0.10.1
soundfile>=0.12.1
numpy>=1.24.3
editdistance>=1.0.11
π WER Analysis
The application provides detailed Word Error Rate analysis:
- Word Error Rate: Percentage of errors
 - Error Breakdown: Insertions, deletions, substitutions
 - Word Statistics: Correct words, total words, accuracy
 - Normalized Texts: Shows processed texts for verification
 
π― Performance Tips
- Clear Speech: Speak clearly for better accuracy
 - Quiet Environment: Minimize background noise
 - Good Microphone: Use quality audio input
 - Model Selection: Choose based on your needs (speed vs accuracy)
 
π€ Contributing
Feel free to submit issues, feature requests, or pull requests to improve this application.
π License
This project is open source and available under the MIT License.
π Acknowledgments
- Hugging Face for the transformers library
 - Facebook for Wav2Vec2 models
 - OpenAI for Whisper models
 - Gradio for the web interface framework