Alignment-Lab-AI herimor commited on
Commit
1e093a4
·
verified ·
0 Parent(s):

Duplicate from herimor/voxtream

Browse files

Co-authored-by: Nikita Torgashov <herimor@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: cc-by-4.0
5
+ pipeline_tag: text-to-speech
6
+ tags:
7
+ - voxtream
8
+ - text-to-speech
9
+ ---
10
+
11
+ # Model Card for VoXtream
12
+
13
+ VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.
14
+
15
+ ### Key features
16
+
17
+ - **Streaming**: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks.
18
+ - **Speed**: Works **5x** times faster than real-time and achieves **102 ms** first packet latency on GPU.
19
+ - **Quality and efficiency**: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets.
20
+
21
+ ### Model Sources
22
+
23
+ - **Repository:** [repo](https://github.com/herimor/voxtream)
24
+ - **Paper:** [paper](https://arxiv.org/pdf/2509.15969)
25
+ - **Demo:** [demo](https://herimor.github.io/voxtream)
26
+
27
+ ## Get started
28
+
29
+ ### Installation
30
+
31
+ ```bash
32
+ pip install voxtream
33
+ ```
34
+
35
+ ### Usage
36
+
37
+ #### Output streaming
38
+ ```bash
39
+ voxtream \
40
+ --prompt-audio assets/audio/male.wav \
41
+ --prompt-text "The liquor was first created as 'Brandy Milk', produced with milk, brandy and vanilla." \
42
+ --text "In general, however, some method is then needed to evaluate each approximation." \
43
+ --output "output_stream.wav"
44
+ ```
45
+ * Note: Initial run may take some time to download model weights.
46
+
47
+ #### Full streaming
48
+ ```bash
49
+ voxtream \
50
+ --prompt-audio assets/audio/female.wav \
51
+ --prompt-text "Betty Cooper helps Archie with cleaning a store room, when Reggie attacks her." \
52
+ --text "Staff do not always do enough to prevent violence." \
53
+ --output "full_stream.wav" \
54
+ --full-stream
55
+ ```
56
+
57
+ ### Out-of-Scope Use
58
+
59
+ Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
60
+
61
+ ## Training Data
62
+
63
+ The model was trained on a 9k-hour subset from [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) and [HiFiTTS2](https://huggingface.co/datasets/nvidia/hifitts-2) datasets. You can download it [here](https://huggingface.co/datasets/herimor/voxtream-train-9k). For more details, please check our paper.
64
+
65
+ ## Citation
66
+
67
+ ```
68
+ @article{torgashov2025voxtream,
69
+ author = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
70
+ title = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
71
+ journal = {arXiv:2509.15969},
72
+ year = {2025}
73
+ }
74
+ ```
config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "phone_former": "phone_former",
3
+ "temp_former": "temp_former",
4
+ "dep_former": "dep_former_csm",
5
+ "phone_vocab_size": 73,
6
+ "audio_vocab_size": 2049,
7
+ "embedding_dim": 1024,
8
+ "spk_embedding_dim": 192,
9
+ "num_codebooks": 12,
10
+ "num_phone_states": 4,
11
+ "amortization_divisor": 16,
12
+ "look_ahead": 2,
13
+ "audio_window_size": 250,
14
+ "phone_window_size": 350
15
+ }
dep_former_csm.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f6a8b06be6e4a5aee244b6218a5ce7bd28c8b288a2c5c994af021d2579e6a2fc
3
+ size 637669544
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:73b7039e40434ebe7a7f0faeb91406cd54fc358185e5f836fdd10e36aef377f9
3
+ size 1767213632
phoneme_to_token.json ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "AA0": 0,
3
+ "AA1": 1,
4
+ "AA2": 2,
5
+ "AE0": 3,
6
+ "AE1": 4,
7
+ "AE2": 5,
8
+ "AH0": 6,
9
+ "AH1": 7,
10
+ "AH2": 8,
11
+ "AO0": 9,
12
+ "AO1": 10,
13
+ "AO2": 11,
14
+ "AW0": 12,
15
+ "AW1": 13,
16
+ "AW2": 14,
17
+ "AY0": 15,
18
+ "AY1": 16,
19
+ "AY2": 17,
20
+ "B": 18,
21
+ "CH": 19,
22
+ "D": 20,
23
+ "DH": 21,
24
+ "EH0": 22,
25
+ "EH1": 23,
26
+ "EH2": 24,
27
+ "ER0": 25,
28
+ "ER1": 26,
29
+ "ER2": 27,
30
+ "EY0": 28,
31
+ "EY1": 29,
32
+ "EY2": 30,
33
+ "F": 31,
34
+ "G": 32,
35
+ "HH": 33,
36
+ "IH0": 34,
37
+ "IH1": 35,
38
+ "IH2": 36,
39
+ "IY0": 37,
40
+ "IY1": 38,
41
+ "IY2": 39,
42
+ "JH": 40,
43
+ "K": 41,
44
+ "L": 42,
45
+ "M": 43,
46
+ "N": 44,
47
+ "NG": 45,
48
+ "OW0": 46,
49
+ "OW1": 47,
50
+ "OW2": 48,
51
+ "OY0": 49,
52
+ "OY1": 50,
53
+ "OY2": 51,
54
+ "P": 52,
55
+ "R": 53,
56
+ "S": 54,
57
+ "SH": 55,
58
+ "T": 56,
59
+ "TH": 57,
60
+ "UH0": 58,
61
+ "UH1": 59,
62
+ "UH2": 60,
63
+ "UW0": 61,
64
+ "UW1": 62,
65
+ "UW2": 63,
66
+ "V": 64,
67
+ "W": 65,
68
+ "Y": 66,
69
+ "Z": 67,
70
+ "ZH": 68,
71
+ "sil": 69,
72
+ "spn": 70
73
+ }