Daisy Tokenizer
Custom byte-level BPE tokenizer trained for the Daisy language model, optimized for Python code and instruction-following tasks.
Details
| Property |
Value |
| Vocabulary size |
49,152 |
| Algorithm |
Byte-level BPE |
| Pre-tokenizer |
Llama-3 style regex |
| Chat format |
ChatML |
| Max length |
131,072 tokens |
| Training date |
2026-01-14 |
Features
- Python-optimized: Trained on Python code for efficient tokenization
- Tool calling: Native support for
<|tool_call|> / <|tool_result|> patterns
- Inline computation: Support for
<|python|> / <|output|> for calculator-style reasoning
- Chain-of-thought:
<|think|> tokens for reasoning blocks
- No UNK tokens: Byte-level fallback handles any Unicode input
Special Tokens
| Token |
ID |
Purpose |
<|endoftext|> |
49131 |
End of sequence / BOS |
<|pad|> |
49132 |
Padding token |
<|im_start|> |
49133 |
Start of message (ChatML) |
<|im_end|> |
49134 |
End of message (ChatML) |
<|tool_call|> |
49135 |
Start of tool call |
<|/tool_call|> |
49136 |
End of tool call |
<|tool_result|> |
49137 |
Start of tool result |
<|/tool_result|> |
49138 |
End of tool result |
<|python|> |
49139 |
Start of Python expression |
<|/python|> |
49140 |
End of Python expression |
<|output|> |
49141 |
Start of computed output |
<|/output|> |
49142 |
End of computed output |
<|think|> |
49143 |
Start of reasoning block |
<|/think|> |
49144 |
End of reasoning block |
<|system|> |
49145 |
System role marker |
<|user|> |
49146 |
User role marker |
<|assistant|> |
49147 |
Assistant role marker |
<|reserved_0|> |
49148 |
Reserved |
<|reserved_1|> |
49149 |
Reserved |
<|reserved_2|> |
49150 |
Reserved |
<|reserved_3|> |
49151 |
Reserved |
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jonathanmiddleton/daisy")
tokens = tokenizer.encode("Hello, world!")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there! How can I help you?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
Chat Template Format
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>
Tool Calling Example
<|im_start|>assistant
Let me calculate that for you.
<|tool_call|>{"name": "calculator", "arguments": {"expression": "2 + 2"}}<|/tool_call|>
<|tool_result|>4<|/tool_result|>
The answer is 4.<|im_end|>
Compression Ratios
Benchmarked against common tokenizers on Python code, prose, and instruction data:
Python Code (SmolTalk self-oss-instruct, 504 samples)
| Tokenizer |
Vocab Size |
Chars/Token |
Tokens |
| meta-llama/Llama-3.2-3B-Instruct |
128,000 |
4.391 |
88,644 |
| Qwen/Qwen2.5-1.5B-Instruct |
151,643 |
4.366 |
89,139 |
| HuggingFaceTB/SmolLM2-135M-Instruct |
49,152 |
3.906 |
99,650 |
| JonathanMiddleton/daisy |
49,131 |
3.766 |
103,349 |
| microsoft/phi-2 |
50,257 |
3.628 |
107,290 |
| openai-community/gpt2 |
50,257 |
3.152 |
123,467 |
English Prose (FineWeb-Edu, 505 samples)
| Tokenizer |
Vocab Size |
Chars/Token |
Tokens |
| meta-llama/Llama-3.2-3B-Instruct |
128,000 |
4.681 |
466,617 |
| JonathanMiddleton/daisy |
49,131 |
4.594 |
475,422 |
| openai-community/gpt2 |
50,257 |
4.584 |
476,460 |
| microsoft/phi-2 |
50,257 |
4.584 |
476,461 |
| Qwen/Qwen2.5-1.5B-Instruct |
151,643 |
4.563 |
478,607 |
| HuggingFaceTB/SmolLM2-135M-Instruct |
49,152 |
4.475 |
488,120 |
Instructions (SmolTalk, 504 samples)
| Tokenizer |
Vocab Size |
Chars/Token |
Tokens |
| meta-llama/Llama-3.2-3B-Instruct |
128,000 |
4.771 |
737,130 |
| Qwen/Qwen2.5-1.5B-Instruct |
151,643 |
4.731 |
743,360 |
| JonathanMiddleton/daisy |
49,131 |
4.487 |
783,803 |
| HuggingFaceTB/SmolLM2-135M-Instruct |
49,152 |
4.455 |
789,399 |
| microsoft/phi-2 |
50,257 |
4.437 |
792,658 |
| openai-community/gpt2 |
50,257 |
4.254 |
826,711 |
Cross-Content Average
| Tokenizer |
Python |
Prose |
Instruction |
Average |
| meta-llama/Llama-3.2-3B-Instruct |
4.391 |
4.681 |
4.771 |
4.614 |
| Qwen/Qwen2.5-1.5B-Instruct |
4.366 |
4.563 |
4.731 |
4.554 |
| JonathanMiddleton/daisy |
3.766 |
4.594 |
4.487 |
4.282 |
| HuggingFaceTB/SmolLM2-135M-Instruct |
3.906 |
4.475 |
4.455 |
4.278 |
| microsoft/phi-2 |
3.628 |
4.584 |
4.437 |
4.216 |
| openai-community/gpt2 |
3.152 |
4.584 |
4.254 |
3.997 |
Key findings: Daisy achieves competitive compression with a ~49K vocabulary, ranking 2nd among tested similar-sized tokenizers for prose and instructions while maintaining strong Python performance.
Training Data
- General text: lehduong/nemotron-cc-hq (~60%)
- Python code: HuggingFaceTB/smoltalk, self-oss-instruct (~25%)
- Instructions: HuggingFaceTB/OpenHermes-2.5-H4, OpenHermes (~15%)
License
Apache 2.0