CED
Collection
State-of-the-art Efficient Audio classifiers trained on Audioset.
โข
4 items
โข
Updated
CED are simple ViT-Transformer-based models for audio tagging, achieving sota performance on Audioset.
| Model | Parameters (M) | AS-20K (mAP) | AS-2M (mAP) |
|---|---|---|---|
| CED-Tiny | 5.5 | 36.5 | 48.1 |
| CED-Mini | 9.6 | 38.5 | 49.0 |
| CED-Small | 22 | 41.6 | 49.6 |
| CED-Base | 86 | 44.0 | 50.0 |
Notable differences from other available models include:
>>> from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
>>> model_name = "mispeech/ced-tiny"
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_name, trust_remote_code=True)
>>> model = AutoModelForAudioClassification.from_pretrained(model_name, trust_remote_code=True)
>>> import torchaudio
>>> audio, sampling_rate = torchaudio.load("/path-to/JeD5V5aaaoI_931_932.wav")
>>> assert sampling_rate == 16000
>>> inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")
>>> import torch
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_class_id = torch.argmax(logits, dim=-1).item()
>>> model.config.id2label[predicted_class_id]
'Finger snapping'
example_finetune_esc50.ipynb demonstrates how to train a linear head on the ESC-50 dataset with the CED encoder frozen.