Daisy Tokenizer

Custom byte-level BPE tokenizer trained for the Daisy language model, optimized for Python code and instruction-following tasks.

Details

Property Value
Vocabulary size 49,152
Algorithm Byte-level BPE
Pre-tokenizer Llama-3 style regex
Chat format ChatML
Max length 131,072 tokens
Training date 2026-01-14

Features

  • Python-optimized: Trained on Python code for efficient tokenization
  • Tool calling: Native support for <|tool_call|> / <|tool_result|> patterns
  • Inline computation: Support for <|python|> / <|output|> for calculator-style reasoning
  • Chain-of-thought: <|think|> tokens for reasoning blocks
  • No UNK tokens: Byte-level fallback handles any Unicode input

Special Tokens

Token ID Purpose
<|endoftext|> 49131 End of sequence / BOS
<|pad|> 49132 Padding token
<|im_start|> 49133 Start of message (ChatML)
<|im_end|> 49134 End of message (ChatML)
<|tool_call|> 49135 Start of tool call
<|/tool_call|> 49136 End of tool call
<|tool_result|> 49137 Start of tool result
<|/tool_result|> 49138 End of tool result
<|python|> 49139 Start of Python expression
<|/python|> 49140 End of Python expression
<|output|> 49141 Start of computed output
<|/output|> 49142 End of computed output
<|think|> 49143 Start of reasoning block
<|/think|> 49144 End of reasoning block
<|system|> 49145 System role marker
<|user|> 49146 User role marker
<|assistant|> 49147 Assistant role marker
<|reserved_0|> 49148 Reserved
<|reserved_1|> 49149 Reserved
<|reserved_2|> 49150 Reserved
<|reserved_3|> 49151 Reserved

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jonathanmiddleton/daisy")

# Basic encoding
tokens = tokenizer.encode("Hello, world!")

# Chat formatting
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there! How can I help you?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)

Chat Template Format

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>

Tool Calling Example

<|im_start|>assistant
Let me calculate that for you.
<|tool_call|>{"name": "calculator", "arguments": {"expression": "2 + 2"}}<|/tool_call|>
<|tool_result|>4<|/tool_result|>
The answer is 4.<|im_end|>

Compression Ratios

Benchmarked against common tokenizers on Python code, prose, and instruction data:

Python Code (SmolTalk self-oss-instruct, 504 samples)

Tokenizer Vocab Size Chars/Token Tokens
meta-llama/Llama-3.2-3B-Instruct 128,000 4.391 88,644
Qwen/Qwen2.5-1.5B-Instruct 151,643 4.366 89,139
HuggingFaceTB/SmolLM2-135M-Instruct 49,152 3.906 99,650
JonathanMiddleton/daisy 49,131 3.766 103,349
microsoft/phi-2 50,257 3.628 107,290
openai-community/gpt2 50,257 3.152 123,467

English Prose (FineWeb-Edu, 505 samples)

Tokenizer Vocab Size Chars/Token Tokens
meta-llama/Llama-3.2-3B-Instruct 128,000 4.681 466,617
JonathanMiddleton/daisy 49,131 4.594 475,422
openai-community/gpt2 50,257 4.584 476,460
microsoft/phi-2 50,257 4.584 476,461
Qwen/Qwen2.5-1.5B-Instruct 151,643 4.563 478,607
HuggingFaceTB/SmolLM2-135M-Instruct 49,152 4.475 488,120

Instructions (SmolTalk, 504 samples)

Tokenizer Vocab Size Chars/Token Tokens
meta-llama/Llama-3.2-3B-Instruct 128,000 4.771 737,130
Qwen/Qwen2.5-1.5B-Instruct 151,643 4.731 743,360
JonathanMiddleton/daisy 49,131 4.487 783,803
HuggingFaceTB/SmolLM2-135M-Instruct 49,152 4.455 789,399
microsoft/phi-2 50,257 4.437 792,658
openai-community/gpt2 50,257 4.254 826,711

Cross-Content Average

Tokenizer Python Prose Instruction Average
meta-llama/Llama-3.2-3B-Instruct 4.391 4.681 4.771 4.614
Qwen/Qwen2.5-1.5B-Instruct 4.366 4.563 4.731 4.554
JonathanMiddleton/daisy 3.766 4.594 4.487 4.282
HuggingFaceTB/SmolLM2-135M-Instruct 3.906 4.475 4.455 4.278
microsoft/phi-2 3.628 4.584 4.437 4.216
openai-community/gpt2 3.152 4.584 4.254 3.997

Key findings: Daisy achieves competitive compression with a ~49K vocabulary, ranking 2nd among tested similar-sized tokenizers for prose and instructions while maintaining strong Python performance.

Training Data

  • General text: lehduong/nemotron-cc-hq (~60%)
  • Python code: HuggingFaceTB/smoltalk, self-oss-instruct (~25%)
  • Instructions: HuggingFaceTB/OpenHermes-2.5-H4, OpenHermes (~15%)

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train JonathanMiddleton/daisy