Dolma 2 tokenizer, Instruct v2, reasoner version

Slightly modified version of cl100k_base that supports Dolma 1.x and Dolma 2.x special tokens.

Special tokens

This tokenizer supports the following special tokens:

  • <|extra_id_0|>: Not used.
  • <|endoftext|>: Used to mark both beginning and end of text.
  • <|fim_prefix|>: Used to mark the prefix fill-in-the-middle request.
  • <|fim_middle|>: Used to mark the middle fill-in-the-middle request.
  • <|fim_suffix|>: Used to mark the suffix fill-in-the-middle request.
  • |||PHONE_NUMBER|||: Not used. Kept for compatibility with Dolma 1.x.
  • |||EMAIL_ADDRESS|||: Not used. Kept for compatibility with Dolma 1.x.
  • |||IP_ADDRESS|||: Not used. Kept for compatibility with Dolma 1.x.
  • <|im_start|>: Indicates the beginning of a message (turn in a conversation).
  • <|im_end|>: Indicates the end of a message (turn in a conversation).
  • <|extra_id_1|>: Not used.
  • <|extra_id_2|>: Not used.
  • <|extra_id_3|>: Not used.
  • <|extra_id_4|>: Not used.
  • <|extra_id_5|>: Not used.
  • <|extra_id_6|>: Not used.
  • <|extra_id_7|>: Not used.
  • <|extra_id_8|>: Not used.
  • <|extra_id_9|>: Not used.
  • <|extra_id_10|>: Not used.
  • <|endofprompt|>: Not Used.
  • <|pad|>: Symbol to pad input sequences.

Chat template

The chat template is as follows (for reference only, actual template is in tokenizer_config.json):

{% set has_system = messages|selectattr('role', 'equalto', 'system')|list|length > 0 %}
{% if not has_system %}
{{ '<|im_start|>system
You are Olmo, a helpful AI assistant built by Ai2. Your date cutoff is December 2024, and your model weights are available at https://huggingface.co/allenai.<|im_end|>
' }}
{% endif %}
{% for message in messages %}
    {% if message['role'] == 'system' %}
{{ '<|im_start|>system
' + message['content'] }}
        {% if message.get('functions', none) is not none %}
{{ ' <functions>' + message['functions'] + '</functions><|im_end|>
' }}
        {% else %}
{{ ' You do not currently have access to any functions. <functions></functions><|im_end|>
' }}
        {% endif %}
    {% elif message['role'] == 'user' %}
        {% if message.get('functions', none) is not none %}
{{ '<|im_start|>user
' + message['content'] + '
' + '<functions>' + message['functions'] + '</functions><|im_end|>
' }}
        {% else %}
{{ '<|im_start|>user
' + message['content'] + '<|im_end|>
' }}
        {% endif %}
    {% elif message['role'] == 'assistant' %}
{{ '<|im_start|>assistant
' }}
        {% if message.get('content', none) is not none %}
{{ message['content'] }}
        {% endif %}
        {% if message.get('function_calls', none) is not none %}
{{ '<function_calls>' + message['function_calls'] + '</function_calls>' }}
        {% endif %}
        {% if not loop.last %}
{{ '<|im_end|>' + '
' }}
        {% else %}
{{ eos_token }}
        {% endif %}
    {% elif message['role'] == 'environment' %}
{{ '<|im_start|>environment
' + message['content'] + '<|im_end|>
' }}
    {% endif %}
    {% if loop.last and add_generation_prompt %}
{{ '<|im_start|>assistant
<think>' }}
    {% endif %}
{% endfor %}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support