---
library_name: transformers
tags:
  - MoroccanArabic
  - Darija
  - GemMaroc
  - conversational
  - qwen
pipeline_tag: text-generation
datasets:
  - GemMaroc/TULU-3-50k-darija-english
language:
  - ar
  - ary
  - en
base_model:
  - Qwen/Qwen2.5-32B-Instruct
---

# Model Card for Qwen2.5-32B-Instruct-darija

# Qwen2.5-32B-Instruct-darija

Unlocking **Moroccan Darija** proficiency in a state-of-the-art large language model, trained with a _minimal-data, green-AI_ recipe that preserves Qwen2.5-32B-Instruct's exceptional reasoning abilities while adding fluent Darija generation.

---

## Model at a glance

|                     | Details                                                                                                |
| ------------------- | ------------------------------------------------------------------------------------------------------ |
| **Model ID**        | `GemMaroc/Qwen2.5-32B-Instruct-darija`                                                                 |
| **Base model**      | [`Qwen/Qwen2.5-32B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)                        |
| **Architecture**    | Decoder-only Transformer (Qwen2.5)                                                                     |
| **Parameters**      | 32 billion                                                                                             |
| **Context length**  | 32,768 tokens                                                                                          |
| **Training regime** | Supervised fine-tuning (LoRA → merged) on 50 K high-quality Darija/English instructions TULU-50K slice |
| **License**         | Apache 2.0                                                                                             |

---

## Why another Darija model?

- **Inclusive AI** > 36 million speakers of Moroccan Arabic remain underserved by open LLMs.
- **Quality-over-quantity** A carefully curated 50 K instruction set surfaces Darija competence without sacrificing cross-lingual reasoning.
- **Green AI** Qwen2.5-32B-Instruct-darija achieves state-of-the-art Darija scores using minimal energy.
- **Premium Performance** 32B parameters provide the highest quality Darija generation and reasoning capabilities.

---

## Benchmark summary

### Darija Benchmarks

| Model                           | Darija MMLU | Darija HellaSwag | Sentiment Analysis | GSM8K Darija | Summarization (chrF) | ROUGE-1 | ROUGE-L | BERTScore |
| ------------------------------- | ----------- | ---------------- | ------------------ | ------------ | -------------------- | ------- | ------- | --------- |
| Qwen2.5-32B-Instruct            | 63.9 %      | 45.9 %           | 65.8 %             | 78.1 %       | 27.2                 | 9.4     | 9.2     | 37.2      |
| **Qwen2.5-32B-Instruct-darija** | 59.8 %      | **53.1 %**       | 64.8 %             | **84.1 %**   | **27.4**             | 7.5     | 7.3     | **38.9**  |

### English Benchmarks

| Model                           | MMLU       | TruthfulQA | HellaSwag  | GSM8K @5   | GSM8K Gen |
| ------------------------------- | ---------- | ---------- | ---------- | ---------- | --------- |
| Qwen2.5-32B-Instruct            | 73.9 %     | 70.5 %     | 74.0 %     | 78.6 %     | 90.5 %    |
| **Qwen2.5-32B-Instruct-darija** | **77.8 %** | 60.1 %     | **79.9 %** | **79.8 %** | 90.0 %    |

<sub>Zero-shot accuracy; full table in the paper.</sub>

---

## Quick start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "GemMaroc/Qwen2.5-32B-Instruct-darija"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
    max_new_tokens=1024,
    temperature=0.7,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3,
)

messages = [
    {"role": "user", "content": "شنو هي نظرية 'butterfly effect'؟ فسّرها بدارجة ونقّط مثال بسيط."}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(pipe(prompt)[0]["generated_text"][len(prompt):])
```

### Chat template (Qwen2.5 format)

The tokenizer provides a baked-in Jinja template that starts with a **begin-of-sequence** token (`<|im_start|>`), then alternates user/model turns, each wrapped by `<|im_start|>` … `<|im_end|>` markers. When you set `add_generation_prompt=True` it ends after the opening model tag so the model can continue:

```
<|im_start|>user
{user message}<|im_end|>
<|im_start|>assistant
```

The assistant will keep generating tokens until it decides to emit `<|im_end|>`.

```python
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
```

No manual token juggling required—the call above handles BOS, turn delimiters, and newline placement automatically.

---

Pre-quantised checkpoints will be published under the same repo tags (`qwen2.5-32b-darija-awq-int4`, `qwen2.5-32b-darija-gguf-q4_k_m`).

---

## Training recipe (one-paragraph recap)

1. **Data** Translate a 44 K reasoning slice of TULU 50K into Darija, keeping 20 % English for cross-lingual robustness.
2. **LoRA SFT** Rank 16, α = 32, 3 epochs, bf16, context 32,768.
3. **Merge & push** Merge LoRA into base weights (`peft.merge_and_unload`), convert to safetensors, upload.

---

## Limitations & ethical considerations

- Sentiment and abstractive summarisation still trail state-of-the-art.
- Tokeniser is unchanged; rare Darija spellings may fragment.
- Model may inherit societal biases present in pre-training data.
- No RLHF / RLAIF safety alignment yet – apply a moderation layer in production.

---

## Citation

If you use Qwen2.5-32B-Instruct-darija in your work, please cite:

```bibtex
@misc{skiredj2025gemmarocunlockingdarijaproficiency,
      title={GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data},
      author={Abderrahman Skiredj and Ferdaous Azhari and Houdaifa Atou and Nouamane Tazi and Ismail Berrada},
      year={2025},
      eprint={2505.17082},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.17082},
}
```