YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🌍 Multilingual NMT with Knowledge Distillation using FLORES-101

πŸš€ Project Overview

This project explores Multilingual Neural Machine Translation (NMT) through Knowledge Distillation using the FLORES-101 dataset for training and evaluation. The goal is to enable high-quality, bidirectional translation among:

  • 5 Indian Languages: Hindi, Tamil, Telugu, Kannada, Malayalam
  • 5 Global Languages: English, French, German, Spanish, Japanese

Each language is translated to and from every other, creating 90 language pairs (bidirectional translations).


🧠 Methodology

Teacher Model:

  • NLLB (facebook/nllb-200-distilled-600M)
    A strong multilingual model capable of translating between 200+ languages.

Student Models:

  • mBART (facebook/mbart-large-50-many-to-many-mmt)
  • IndicBART (ai4bharat/indicbart)

Distillation Strategy:

The teacher model generates translations for all sentence pairs, and student models are trained to mimic the output. This reduces model size while maintaining translation quality.


πŸ“˜ Dataset: FLORES-101

FLORES-101 provides 101 languages with aligned sentences for translation evaluation. We use the dev set (devtest) to generate high-quality, consistent training pairs.

Languages & FLORES Codes:

Language Code
English eng_Latn
Hindi hin_Deva
Tamil tam_Taml
Telugu tel_Telu
Kannada kan_Knda
Malayalam mal_Mlym
French fra_Latn
German deu_Latn
Spanish spa_Latn
Japanese jpn_Jpan

Data Generation

All possible bidirectional pairs (e.g., en→ta, ta→en, ta→hi, hi→ta) were created, resulting in 90 parallel datasets.


Downloads last month
9
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using Nova35/nllb-mbart-indic-distilled 1