Table of Contents
Overview
SAGE Reasoning Family Models are instruction-tuned, text-in/text-out generative systems released under a permissive open license for commercial use.
Key Features
Hybrid Reasoning Architecture
- Dual Mode Operation: Capable of producing fast direct responses in standard LLM mode, or applying self-reflection before answering in reasoning mode
- Advanced Training: Uses Iterated Distillation and Amplification (IDA) - a scalable alignment method based on iterative self-improvement
Specialized Capabilities
- Code Generation: Optimized for programming tasks with strong coding abilities
- STEM Excellence: Enhanced performance on science, technology, engineering, and mathematics problems
- Instruction Following: Superior adherence to complex instructions and prompts
- Tool Calling: Notable strength in tool-calling ability compared to similar-sized models
Global Reach
- Multilingual Support: Over 30 languages supported
- Extended Context: 128k context window for handling large documents and conversations
- Consistent Performance: Both standard and reasoning variants consistently outperform other models in the same parameter class on public benchmarks
Evaluations
We compare our models against state-of-the-art size-equivalent models in both direct mode and reasoning mode. For direct mode, we compare against Llama/Qwen instruct counterparts. For reasoning, we use Deepseek's R1 distilled counterparts and Qwen's QwQ model.
Overall Performance Benchmarks
Comprehensive benchmark results showing SAGE Reasoning 3B performance across multiple evaluation metrics
Livebench Global Average
Livebench global performance comparison demonstrating consistent superiority
Tool Calling Performance
Tool calling capabilities comparison showing enhanced performance in function calling and tool utilization
Usage
Here is a snippet below for usage with Transformers:
import transformers
import torch
model_id = "sagea-ai/sage-reasoning-8b"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Give me a short introduction to LLMs."},
]
outputs = pipeline(
messages,
max_new_tokens=512,
)
print(outputs[0]["generated_text"][-1])
Implementing extended thinking
- By default, the model will answer in the standard mode.
- To enable thinking, you can do any one of the two methods:
- Add a specific system prompt, or
- Set
enable_thinking=Truewhile applying the chat template.
NOTE: For the SAGE reasoning 8b model, we suggest using
repetition_penalty=1.1while implementing extended thinking.
Method 1 - Add a specific system prompt.
To enable thinking, simply use this in the system prompt system_instruction = 'Enable deep thinking subroutine.'
If you already have a system_instruction, then use system_instruction = 'Enable deep thinking subroutine.' + '\n\n' + system_instruction.
Here is an example -
import transformers
import torch
model_id = "sagea-ai/sage-reasoning-8b"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
messages = [
{"role": "system", "content": DEEP_THINKING_INSTRUCTION},
{"role": "user", "content": "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."},
]
outputs = pipeline(
messages,
max_new_tokens=512,
)
print(outputs[0]["generated_text"][-1])
Similarly, if you have a system prompt, you can append the DEEP_THINKING_INSTRUCTION to the beginning in this way -
DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
system_prompt = "Reply to each prompt with only the actual code - no explanations."
prompt = "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."
messages = [
{"role": "system", "content": DEEP_THINKING_INSTRUCTION + '\n\n' + system_prompt},
{"role": "user", "content": prompt}
]
Method 2 - Set enable_thinking=True in the tokenizer
If you are using Huggingface tokenizers, then you can simply use add the argument enable_thinking=True to the tokenization (this option is added to the chat template).
Here is an example -
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "sagea-ai/sage-reasoning-8b"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Give me a short introduction to LLMs."
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Tool Calling
SAGE reasoning models support tool calling (single, parallel, multiple and parallel_multiple) both in standard and extended thinking mode.
Here is a snippet -
# First, define a tool
def get_current_temperature(location: str) -> float:
"""
Get the current temperature at a location.
Args:
location: The location to get the temperature for, in the format "City, Country"
Returns:
The current temperature at the specified location in the specified units, as a float.
"""
return 22. # A real function should probably actually get the temperature!
# Next, create a chat and apply the chat template
messages = [
{"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
]
model_inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)
text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
output_text = tokenizer.batch_decode(outputs)[0][len(text):]
print(output_text)
This will result in the output -
<tool_call>
{"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
</tool_call><|eot_id|>
You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:
tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
and then call the tool and append the result, with the tool role, like so:
messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
After that, you can generate() again to let the model use the tool result in the chat:
text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
output_text = tokenizer.batch_decode(outputs)[0][len(text):]
This should result in the string -
'The current temperature in Paris is 22.0 degrees.<|eot_id|>'
License
This repository and the model weights are licensed under the Llama 3.2 Community License Agreement (Llama models' default license agreement).
Contact
Get in Touch with Our Team
For inquiries, collaborations, or support, please reach out to us:
Email: founders@sagea.space
SAGE Reasoning 8B
Advancing the frontier of hybrid reasoning models
- Downloads last month
- 4