English–Tigrinya Machine Translation & Tokenizer

📌 Conference

Accepted at the 3rd International Conference on Foundation and Large Language Models (FLLM2025)
📍 25–28 November 2025 | Vienna, Austria

Paper Title: Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks


📝 Model Summary

This repository provides a custom tokenizer and a fine-tuned MarianMT model for English ↔ Tigrinya machine translation.
It leverages the NLLB dataset for training and OPUS parallel corpora for testing and evaluation, with BLEU used as the primary metric.

  • Languages: English (eng), Tigrinya (tig)
  • Tokenizer: SentencePiece, customized for Geez-script representation
  • Model: MarianMT (multilingual transformer) fine-tuned for English–Tigrinya translation
  • License: MIT

🔍 Model Details

Tokenizer

  • Type: SentencePiece-based subword tokenizer
  • Purpose: Handles Geez-script specific tokenization for Tigrinya
  • Training Data: NLLB English–Tigrinya subset
  • Evaluation Data: OPUS parallel corpus

Translation Model

  • Base Model: MarianMT
  • Frameworks: Hugging Face Transformers, PyTorch
  • Task: Bidirectional English ↔ Tigrinya MT

⚙️ Training Details

  • Training Dataset: NLLB Parallel Corpus (English ↔ Tigrinya)
  • Testing Dataset: OPUS Parallel Corpus
  • Epochs: 3
  • Batch Size: 8
  • Max Sequence Length: 128 tokens
  • Learning Rate: 1.44e-07 with decay

Training Loss

  • Epoch 1: 0.443
  • Epoch 2: 0.4077
  • Epoch 3: 0.4379
  • Final Loss: 0.4756

Gradient Norms

  • Epoch 1: 1.14
  • Epoch 2: 1.11
  • Epoch 3: 1.06

Performance

  • Training Time: ~12 hours (43,376.7s)
  • Speed: 96.7 samples/sec | 12.08 steps/sec

📊 Evaluation

  • Metric: BLEU score
  • Evaluation Dataset: OPUS parallel English–Tigrinya

🚀 Usage

This model can be directly used for English → Tigrinya and Tigrinya → English translation.

Example (Python)

from transformers import MarianMTModel, MarianTokenizer

# Load the model and tokenizer
model_name = "Hailay/MachineT_TigEng"  
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)

# Translate English → Tigrinya
english_text = "We must obey the Lord and leave them alone"
inputs = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True)
translated = model.generate(**inputs)
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)

print("Translated text:", translated_text)



##  📌Citation

If you use this model or tokenizer in your work, please cite:

@inproceedings{hailay2025lowres,
  title     = {Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks},
  author    = {Hailay Kidu and collaborators},
  booktitle = {Proceedings of the 3rd International Conference on Foundation and Large Language Models (FLLM2025)},
  year      = {2025},
  location  = {Vienna, Austria}
}
Downloads last month
49
Safetensors
Model size
76.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support