Spaces:

mostafaashahin
/

asr_trials

Runtime error

App Files Files Community

asr_trials / README.md

mostafaashahin's picture

Update README.md

84189a4 verified 3 months ago

|

history blame contribute delete

3.58 kB

A newer version of the Gradio SDK is available: 5.49.1

Upgrade

metadata

title: asr-multi-model
emoji: 😊
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false

🎤 Multi-Model ASR Speech Recognition

A comprehensive Automatic Speech Recognition (ASR) application with support for multiple models and Word Error Rate (WER) analysis.

✨ Features

🤖 Multiple Models: Support for Wav2Vec2 and Whisper models
🎤 Audio Recording: Direct microphone recording
📁 File Upload: Support for various audio formats
📊 WER Analysis: Calculate Word Error Rate with detailed breakdown
💾 Memory Efficient: Dynamic model loading and cleanup
🌍 Multilingual: Whisper models support multiple languages

🚀 Live Demo

This application is hosted on Hugging Face Spaces. You can access it at: [Your HF Spaces URL]

🤖 Available Models

Model	Type	Speed	Accuracy	Memory	Best For
Wav2Vec2 Base (100h)	Wav2Vec2	⚡ Fast	🟡 Good	~300MB	Basic tasks
Wav2Vec2 Base (960h)	Wav2Vec2	🟡 Balanced	🟢 Better	~1GB	General use
Wav2Vec2 Large (960h)	Wav2Vec2	🐌 Slower	🔴 High	~3GB	Difficult audio
Whisper Large V3 Turbo	Whisper	🐌 Slower	🔴 Best	~5GB	Multilingual

📖 How to Use

Select Model: Choose from available Wav2Vec2 and Whisper models
Load Model: Click 'Load Model' to load the selected model
Record/Upload: Record audio or upload an audio file
Transcribe: Click 'Transcribe' or wait for auto-transcription
WER Analysis: Enter reference text to calculate Word Error Rate
Copy Text: Use 'Copy Text' to copy the result

🔧 Technical Details

Models Used

Wav2Vec2: Facebook's self-supervised speech recognition models
Whisper: OpenAI's multilingual speech recognition model

Audio Processing

Automatic resampling to 16kHz
Mono conversion for stereo audio
Audio normalization
Support for various formats (MP3, WAV, M4A, FLAC)

WER Calculation

Uses edit distance for accurate alignment
Normalizes text (lowercase, no punctuation)
Provides detailed breakdown of insertions, deletions, and substitutions

🛠️ Local Development

Prerequisites

Python 3.8+
CUDA-compatible GPU (optional, for faster inference)

Installation

git clone [your-repo-url]
cd [your-repo-name]
pip install -r requirements.txt
python app.py

Requirements

gradio>=4.44.0
torch>=2.6.0
torchaudio>=2.6.0
transformers>=4.36.2
librosa>=0.10.1
soundfile>=0.12.1
numpy>=1.24.3
editdistance>=1.0.11

📊 WER Analysis

The application provides detailed Word Error Rate analysis:

Word Error Rate: Percentage of errors
Error Breakdown: Insertions, deletions, substitutions
Word Statistics: Correct words, total words, accuracy
Normalized Texts: Shows processed texts for verification

🎯 Performance Tips

Clear Speech: Speak clearly for better accuracy
Quiet Environment: Minimize background noise
Good Microphone: Use quality audio input
Model Selection: Choose based on your needs (speed vs accuracy)

🤝 Contributing

Feel free to submit issues, feature requests, or pull requests to improve this application.

📝 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

Hugging Face for the transformers library
Facebook for Wav2Vec2 models
OpenAI for Whisper models
Gradio for the web interface framework