tiny-audio / README.md

mazesmazes

Update custom model files, README, and requirements

71301b8 verified about 21 hours ago

preview code

raw

history blame contribute delete

1.88 kB

metadata

license: mit
language:
  - en
datasets:
  - speechbrain/LoquaciousSet
base_model:
  - openai/whisper-large-v3-turbo
  - HuggingFaceTB/SmolLM3-3B
pipeline_tag: automatic-speech-recognition
tags:
  - asr
  - speech-recognition
  - audio
  - smollm
  - whisper
  - mlp

Tiny Audio

A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with the Tiny Audio codebase—a minimal, hackable framework for training ASR models.

Architecture

Audio (16kHz) → Whisper Encoder (frozen) → MLP Projector (trained) → SmolLM3-3B (frozen) → Text

MLP Projector:

Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
Linear (1280 → 2048) → GELU → Linear (2048 → 2048)
Output normalization: RMSNorm

Training Details


Dataset	LoquaciousSet (25,000 hours)
Hardware	Single NVIDIA A40 40GB
Training Time	~24 hours
Cost	~$12
Trainable Parameters	~12M (projector only)

Performance

Word Error Rate (WER): 12.14% on LoquaciousSet test set.

Usage

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)

result = pipe("path/to/audio.wav")
print(result["text"])

Limitations

English only
Optimized for 16kHz audio; other sample rates are resampled automatically
Performance may degrade on heavily accented speech, noisy environments, or domain-specific jargon
Maximum audio length limited by context window

Learn More

Train your own model — The full codebase with training scripts
Free 3.5-hour course — Build your own ASR system from scratch