π Multilingual NMT with Knowledge Distillation using FLORES-101
π Project Overview
This project explores Multilingual Neural Machine Translation (NMT) through Knowledge Distillation using the FLORES-101 dataset for training and evaluation. The goal is to enable high-quality, bidirectional translation among:
- 5 Indian Languages: Hindi, Tamil, Telugu, Kannada, Malayalam
- 5 Global Languages: English, French, German, Spanish, Japanese
Each language is translated to and from every other, creating 90 language pairs (bidirectional translations).
π§ Methodology
Teacher Model:
- NLLB (facebook/nllb-200-distilled-600M)
A strong multilingual model capable of translating between 200+ languages.
Student Models:
- mBART (facebook/mbart-large-50-many-to-many-mmt)
- IndicBART (ai4bharat/indicbart)
Distillation Strategy:
The teacher model generates translations for all sentence pairs, and student models are trained to mimic the output. This reduces model size while maintaining translation quality.
π Dataset: FLORES-101
FLORES-101 provides 101 languages with aligned sentences for translation evaluation. We use the dev set (devtest) to generate high-quality, consistent training pairs.
Languages & FLORES Codes:
| Language | Code |
|---|---|
| English | eng_Latn |
| Hindi | hin_Deva |
| Tamil | tam_Taml |
| Telugu | tel_Telu |
| Kannada | kan_Knda |
| Malayalam | mal_Mlym |
| French | fra_Latn |
| German | deu_Latn |
| Spanish | spa_Latn |
| Japanese | jpn_Jpan |
Data Generation
All possible bidirectional pairs (e.g., enβta, taβen, taβhi, hiβta) were created, resulting in 90 parallel datasets.
- Downloads last month
- 9