CLSP

CLSP (pronounced /klɪsp/) is a contrastive language–speech pretraining model that integrates global and fine-grained supervision to learn unified representations across multiple granularities. It performs reliably across global and fine-grained speech-text retrieval, zero-shot paralinguistic classification, and speech style similarity scoring, with strong alignment to human judgments.

GitHub Repo: CLSP

Note: The model is pretrained on 16kHz sampled speech/audio data. When using the model, make sure that your input is also sampled at 16kHz.

Paper: Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training

Authors: Yifan Yang, Bing Han, Hui Wang, Wei Wang, Ziyang Ma, Long Zhou, Zengrui Jin, Guanrou Yang, Tianrui Wang, Xu Tan, Xie Chen

Abstract: Modeling fine-grained speaking styles remains challenging for language-speech representation pre-training, as existing speech-text models are typically trained with coarse captions or task-specific supervision, and scalable fine-grained style annotations are unavailable. We present FCaps, a large-scale dataset with fine-grained free-text style descriptions, encompassing 47k hours of speech and 19M fine-grained captions annotated via a novel end-to-end pipeline that directly grounds detailed captions in audio, thereby avoiding the error propagation caused by LLM-based rewriting in existing cascaded pipelines. Evaluations using LLM-as-a-judge demonstrate that our annotations surpass existing cascaded annotations in terms of correctness, coverage, and naturalness. Building on FCaps, we propose CLSP, a contrastive language-speech pre-trained model that integrates global and fine-grained supervision, enabling unified representations across multiple granularities. Extensive experiments demonstrate that CLSP learns fine-grained and multi-granular speech-text representations that perform reliably across global and fine-grained speech-text retrieval, zero-shot paralinguistic classification, and speech style similarity scoring, with strong alignment to human judgments. All resources will be made publicly available.

Usage

import torch
from transformers import AutoModel


model = AutoModel.from_pretrained(
    "yfyeung/CLSP", 
    trust_remote_code=True,
)
if torch.cuda.is_available():
    model = model.to("cuda")

device = next(model.parameters()).device
audio = torch.randn(1, 160000).to(device)  # dummy audio input of 10 seconds
audio_lens = torch.tensor([160000]).to(device)
text = [
    "A female speaker with a medium-pitched British accent.",
    "A male speaker with a medium-pitched British accent.",
    "A female speaker delivers her enunciated words rapidly in a medium-pitched British accent, conveying an authoritative tone.",
    "A female speaker delivers her enunciated words slowly in a medium-pitched Chinese accent, conveying an authoritative tone.",
    "A mature female with a clear, medium-pitched voice and a British accent speaks in a formal, presentational style, characteristic of a newsreader or broadcaster. She delivers her speech at a fast pace with deliberate enunciation and a measured, authoritative rhythm. Her tone remains neutral and informative, with subtle emphasis on specific phrases, and her volume is consistently loud and steady. The delivery is fluent and controlled."
]

with torch.no_grad():
    audio_embedding, text_embedding, logit_scale = model(audio, audio_lens, text)

print(audio_embedding)
print(text_embedding)
print(logit_scale)

Citation

Please cite our paper if you find this work useful:

@misc{yang2026clsp,
    title={Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training}, 
    author={Yifan Yang and Bing Han and Hui Wang and Wei Wang and Ziyang Ma and Long Zhou and Zengrui Jin and Guanrou Yang and Tianrui Wang and Xu Tan and Xie Chen},
    year={2026},
    eprint={2601.03065},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2601.03065}, 
}

Downloads last month: 84

Safetensors

Model size

0.7B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train yfyeung/CLSP

Collection including yfyeung/CLSP

CLSP

Collection

Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training • 4 items • Updated 3 days ago

Paper for yfyeung/CLSP

Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training

Paper • 2601.03065 • Published 17 days ago