License Model Library Pipeline Languages Context

SAGE Logo

SAGE Reasoning 8B

Advanced Hybrid Reasoning Model with Tool-Calling Capabilities

Open in HuggingFace License


Table of Contents


Overview

SAGE Reasoning Family Models are instruction-tuned, text-in/text-out generative systems released under a permissive open license for commercial use.

Key Features

Hybrid Reasoning Architecture

  • Dual Mode Operation: Capable of producing fast direct responses in standard LLM mode, or applying self-reflection before answering in reasoning mode
  • Advanced Training: Uses Iterated Distillation and Amplification (IDA) - a scalable alignment method based on iterative self-improvement

Specialized Capabilities

  • Code Generation: Optimized for programming tasks with strong coding abilities
  • STEM Excellence: Enhanced performance on science, technology, engineering, and mathematics problems
  • Instruction Following: Superior adherence to complex instructions and prompts
  • Tool Calling: Notable strength in tool-calling ability compared to similar-sized models

Global Reach

  • Multilingual Support: Over 30 languages supported
  • Extended Context: 128k context window for handling large documents and conversations
  • Consistent Performance: Both standard and reasoning variants consistently outperform other models in the same parameter class on public benchmarks

Evaluations

We compare our models against state-of-the-art size-equivalent models in both direct mode and reasoning mode. For direct mode, we compare against Llama/Qwen instruct counterparts. For reasoning, we use Deepseek's R1 distilled counterparts and Qwen's QwQ model.

Overall Performance Benchmarks

Overall Performance Benchmarks

Comprehensive benchmark results showing SAGE Reasoning 3B performance across multiple evaluation metrics

Livebench Global Average

Livebench Global Average Performance

Livebench global performance comparison demonstrating consistent superiority

Tool Calling Performance

Tool Calling Benchmarks

Tool calling capabilities comparison showing enhanced performance in function calling and tool utilization


Usage

Here is a snippet below for usage with Transformers:

import transformers
import torch

model_id = "sagea-ai/sage-reasoning-8b"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Give me a short introduction to LLMs."},
]

outputs = pipeline(
    messages,
    max_new_tokens=512,
)

print(outputs[0]["generated_text"][-1])

Implementing extended thinking

  • By default, the model will answer in the standard mode.
  • To enable thinking, you can do any one of the two methods:
    • Add a specific system prompt, or
    • Set enable_thinking=True while applying the chat template.

NOTE: For the SAGE reasoning 8b model, we suggest using repetition_penalty=1.1 while implementing extended thinking.

Method 1 - Add a specific system prompt.

To enable thinking, simply use this in the system prompt system_instruction = 'Enable deep thinking subroutine.'

If you already have a system_instruction, then use system_instruction = 'Enable deep thinking subroutine.' + '\n\n' + system_instruction.

Here is an example -

import transformers
import torch

model_id = "sagea-ai/sage-reasoning-8b"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."

messages = [
    {"role": "system", "content": DEEP_THINKING_INSTRUCTION},
    {"role": "user", "content": "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."},
]

outputs = pipeline(
    messages,
    max_new_tokens=512,
)

print(outputs[0]["generated_text"][-1])

Similarly, if you have a system prompt, you can append the DEEP_THINKING_INSTRUCTION to the beginning in this way -

DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."

system_prompt = "Reply to each prompt with only the actual code - no explanations."
prompt = "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."

messages = [
    {"role": "system", "content": DEEP_THINKING_INSTRUCTION + '\n\n' + system_prompt},
    {"role": "user", "content": prompt}
]

Method 2 - Set enable_thinking=True in the tokenizer

If you are using Huggingface tokenizers, then you can simply use add the argument enable_thinking=True to the tokenization (this option is added to the chat template).

Here is an example -

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "sagea-ai/sage-reasoning-8b"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to LLMs."
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Tool Calling

SAGE reasoning models support tool calling (single, parallel, multiple and parallel_multiple) both in standard and extended thinking mode.

Here is a snippet -

# First, define a tool
def get_current_temperature(location: str) -> float:
    """
    Get the current temperature at a location.
    
    Args:
        location: The location to get the temperature for, in the format "City, Country"
    Returns:
        The current temperature at the specified location in the specified units, as a float.
    """
    return 22.  # A real function should probably actually get the temperature!

# Next, create a chat and apply the chat template
messages = [
  {"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
]

model_inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)

text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
output_text = tokenizer.batch_decode(outputs)[0][len(text):]
print(output_text)

This will result in the output -

<tool_call>
{"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
</tool_call><|eot_id|>

You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:

tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})

and then call the tool and append the result, with the tool role, like so:

messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})

After that, you can generate() again to let the model use the tool result in the chat:

text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
output_text = tokenizer.batch_decode(outputs)[0][len(text):]

This should result in the string -

'The current temperature in Paris is 22.0 degrees.<|eot_id|>'

License

This repository and the model weights are licensed under the Llama 3.2 Community License Agreement (Llama models' default license agreement).

License

Contact

Get in Touch with Our Team

For inquiries, collaborations, or support, please reach out to us:

Email: founders@sagea.space


SAGE Reasoning 8B
Advancing the frontier of hybrid reasoning models

Made by SAGEA

Downloads last month
4
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sagea-ai/sage-reasoning-8b

Quantizations
3 models

Collection including sagea-ai/sage-reasoning-8b