Model Card for Daemontatox/Zirel-2

Model Name: Daemontatox/Zirel-2

Author: Daemontatox (Ammar)

Base Model: Qwen/Qwen3-30B-A3B-Instruct-2507

Model Details

Zirel-2 is the latest iteration in the Zirel series of models, developed by Daemontatox. It is a fine-tuned version of the powerful Qwen3-30B-A3B-Instruct-2507 model, which is a Mixture-of-Experts (MoE) architecture from Alibaba's Qwen team .

The primary goals of this fine-tuning are:

Maximize Knowledge & Capability: To enhance the model's general knowledge and reasoning abilities, making it a highly capable personal assistant.
Optimize for Efficiency: To leverage the inherent efficiency of the MoE architecture, ensuring high performance while keeping GPU memory usage and computational costs to a minimum.

The base model, Qwen3-30B-A3B-Instruct-2507, has a total of 30.5 billion parameters but only activates approximately 3.3 billion parameters during each inference step [[12], [14]]. This design allows it to deliver performance comparable to much larger, dense models while being significantly more resource-efficient .

Intended Use

Zirel-2 is designed to be a smart, versatile, and efficient personal assistant. It excels at tasks such as:

Answering complex questions with detailed reasoning.
Generating high-quality text, code, and creative content.
Performing logical and mathematical reasoning.
Following intricate, multi-step instructions.

Its efficiency makes it particularly suitable for deployment on consumer-grade hardware or in environments where computational resources are constrained.

Architecture

Type: Mixture-of-Experts (MoE) Language Model.
Total Parameters: ~30.5B.
Active Parameters per Inference: ~3.3B .
Experts: The base model utilizes 128 experts, with 8 being activated for any given token .
Context Length: Supports a very long context window of up to 262,144 tokens .

Example Inference Code

You can run inference with Zirel-2 using the Hugging Face transformers library. For optimal performance with MoE models, it's recommended to use a library like vLLM or TensorRT-LLM, but a basic example with transformers is provided below.

Prerequisites:

pip install transformers accelerate torch

Basic Inference with transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# Load the model and tokenizer
model_name = "Daemontatox/Zirel-2"

# It's highly recommended to use 4-bit or 8-bit quantization to fit the model in memory
# Here we use 4-bit quantization with bitsandbytes
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
    torch_dtype=torch.float16,
)

# Create a text generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

# Define your prompt. Qwen models typically use this chat template.
messages = [
    {"role": "system", "content": "You are a helpful and knowledgeable AI assistant named Zirel-2."},
    {"role": "user", "content": "Explain the concept of Mixture of Experts (MoE) in simple terms and why it's efficient."}
]

# Apply the chat template
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Generate a response
outputs = pipe(prompt)
response = outputs[0]["generated_text"][len(prompt):]
print(response)

For Better Performance (Recommended):

For serious use, consider using vLLM, which has excellent support for MoE models and provides much higher throughput and lower latency.

# Example using vLLM (you need to install vLLM first: pip install vllm)
from vllm import LLM, SamplingParams

# Initialize the LLM engine
llm = LLM(model="Daemontatox/Zirel-2", tensor_parallel_size=1) # Adjust tensor_parallel_size for your GPU setup

# Create a sampling params object
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Prepare your prompt using the chat template
tokenizer = AutoTokenizer.from_pretrained("Daemontatox/Zirel-2")
messages = [
    {"role": "system", "content": "You are a helpful and knowledgeable AI assistant named Zirel-2."},
    {"role": "user", "content": "Explain the concept of Mixture of Experts (MoE) in simple terms and why it's efficient."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate text
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Downloads last month: 106

Safetensors

Model size

31B params

Tensor type

BF16

Model tree for Daemontatox/Zirel-2

Quantizations

2 models