--- license: mit datasets: - jhu-clsp/mmbert-decay - jhu-clsp/mmbert-midtraining - jhu-clsp/mmbert-pretrain-p1-fineweb2-langs - jhu-clsp/mmbert-pretrain-p2-fineweb2-remaining - jhu-clsp/mmbert-pretrain-p3-others pipeline_tag: fill-mask --- # mmBERT: A Modern Multilingual Encoder [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2509.06888) [![Model](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-blue)](https://huggingface.co/jhu-clsp/mmBERT-base) [![Collection](https://img.shields.io/badge/🤗%20Model%20Collection-blue)](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/mmBERT) > TL;DR: A state-of-the-art multilingual encoder trained on 3T+ tokens across 1800+ languages, introducing novel techniques for learning low-resource languages during the decay phase. mmBERT is a modern multilingual encoder that significantly outperforms previous generation models like XLM-R on classification, embedding, and retrieval tasks. Built on the ModernBERT architecture with novel multilingual training innovations, mmBERT demonstrates that low-resource languages can be effectively learned during the decay phase of training. It is also significantly faster than any previous multilingual encoder. ## Table of Contents - [Highlights](#highlights) - [Quick Start](#quick-start) - [Model Description](#model-description) - [Novel Training Innovations](#novel-training-innovations) - [Model Family](#model-family) - [Training Data](#training-data) - [Usage Examples](#usage-examples) - [Fine-tuning Examples](#fine-tuning-examples) - [Model Architecture](#model-architecture) - [Citation](#citation) ## Quick Start ### Installation ```bash pip install torch>=1.9.0 pip install transformers>=4.21.0 ``` ### Usage ```python from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base") model = AutoModel.from_pretrained("jhu-clsp/mmBERT-base") inputs = tokenizer("Hello world", return_tensors="pt") outputs = model(**inputs) ``` ## Model Description mmBERT represents the first significant advancement over XLM-R for massively multilingual encoder models. Key features include: 1. **Massive Language Coverage** - Trained on over 1800 languages with progressive inclusion strategy 2. **Modern Architecture** - Built on ModernBERT foundation with Flash Attention 2 and unpadding techniques 3. **Novel Training Recipe** - Introduces inverse mask scheduling and temperature sampling 4. **Open Training Data** - Complete 3T+ token dataset publicly available 5. **Decay Phase Innovation** - Demonstrates effective learning of low-resource languages in final training phase The model uses bidirectional attention with masked language modeling objectives, optimized specifically for multilingual understanding and cross-lingual transfer. ## Novel Training Innovations **Progressive Language Addition**: Start with 60 high-resource languages, expand to 110 mid-resource languages, then include all 1833 languages in decay phase. **Inverse Mask Schedule**: Reduce mask ratio from 30% → 15% → 5% across training phases for progressively refined learning. **Inverse Temperature Sampling**: Adjust multilingual sampling from high-resource bias (τ=0.7) to uniform sampling (τ=0.3). **Model Merging**: Combine English-focused, high-resource, and all-language decay variants using TIES merging. ## Model Family | Model | Total Params | Non-embed Params | Languages | Download | |:------|:-------------|:------------------|:----------|:---------| | [mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small) | 140M | 42M | 1800+ | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/mmBERT-small) | | [mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) | 307M | 110M | 1800+ | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/mmBERT-base) | ## Training Data mmBERT training data is publicly available across different phases: | Phase | Dataset | Tokens | Description | |:------|:--------|:-------|:------------| | Pre-training P1 | [mmbert-pretrain-p1](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs) | 2.3T | 60 languages, foundational training | | Pre-training P2 | [mmbert-pretrain-p2](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p2-fineweb2-langs) | - | Extension data for pre-training phase | | Pre-training P3 | [mmbert-pretrain-p3](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p3-fineweb2-langs) | - | Final pre-training data | | Mid-training | [mmbert-midtraining](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining-data) | 600B | 110 languages, context extension to 8K | | Decay Phase | [mmbert-decay](https://huggingface.co/datasets/jhu-clsp/mmbert-decay-data) | 100B | 1833 languages, premium quality | **Data Sources**: Filtered DCLM (English), FineWeb2 (multilingual), FineWeb2-HQ (20 high-resource languages), Wikipedia (MegaWika), code repositories (StarCoder, ProLong), academic papers (ArXiv, PeS2o), and community discussions (StackExchange). ## Model Architecture | Parameter | mmBERT-small | mmBERT-base | |:----------|:-------------|:------------| | Layers | 22 | 22 | | Hidden Size | 384 | 768 | | Intermediate Size | 1152 | 1152 | | Attention Heads | 6 | 12 | | Total Parameters | 140M | 307M | | Non-embedding Parameters | 42M | 110M | | Max Sequence Length | 8192 | 8192 | | Vocabulary Size | 256,000 | 256,000 | | Tokenizer | Gemma 2 | Gemma 2 | ## Usage Examples ### Masked Language Modeling ```python from transformers import AutoTokenizer, AutoModelForMaskedLM import torch tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base") model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/mmBERT-base") def predict_masked_token(text): inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id) predictions = outputs.logits[mask_indices] top_tokens = torch.topk(predictions, 5, dim=-1) return [tokenizer.decode(token) for token in top_tokens.indices[0]] # Works across languages texts = [ "The capital of France is .", "La capital de España es .", "Die Hauptstadt von Deutschland ist ." ] for text in texts: predictions = predict_masked_token(text) print(f"Text: {text}") print(f"Predictions: {predictions}") ``` ### Cross-lingual Embeddings ```python from transformers import AutoTokenizer, AutoModel import torch from sklearn.metrics.pairwise import cosine_similarity tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base") model = AutoModel.from_pretrained("jhu-clsp/mmBERT-base") def get_embeddings(texts): inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) embeddings = outputs.last_hidden_state.mean(dim=1) return embeddings.numpy() multilingual_texts = [ "Artificial intelligence is transforming technology", "La inteligencia artificial está transformando la tecnología", "L'intelligence artificielle transforme la technologie", "人工智能正在改变技术" ] embeddings = get_embeddings(multilingual_texts) similarities = cosine_similarity(embeddings) print("Cross-lingual similarity matrix:") print(similarities) ``` ## Fine-tuning Examples ### Dense Retrieval with Sentence Transformers
Click to expand dense retrieval fine-tuning example ```python import argparse from datasets import load_dataset from sentence_transformers import ( SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments, ) from sentence_transformers.evaluation import TripletEvaluator from sentence_transformers.losses import CachedMultipleNegativesRankingLoss from sentence_transformers.training_args import BatchSamplers def main(): parser = argparse.ArgumentParser() parser.add_argument("--lr", type=float, default=8e-5) parser.add_argument("--model_name", type=str, default="jhu-clsp/mmBERT-base") args = parser.parse_args() lr = args.lr model_name = args.model_name model_shortname = model_name.split("/")[-1] model = SentenceTransformer(model_name) dataset = load_dataset( "sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1", "triplet-hard", split="train", ) dataset_dict = dataset.train_test_split(test_size=1_000, seed=12) train_dataset = dataset_dict["train"].select(range(1_250_000)) eval_dataset = dataset_dict["test"] loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16) run_name = f"{model_shortname}-DPR-{lr}" training_args = SentenceTransformerTrainingArguments( output_dir=f"output/{model_shortname}/{run_name}", num_train_epochs=1, per_device_train_batch_size=512, per_device_eval_batch_size=512, warmup_ratio=0.05, fp16=False, bf16=True, batch_sampler=BatchSamplers.NO_DUPLICATES, learning_rate=lr, save_strategy="steps", save_steps=500, save_total_limit=2, logging_steps=500, run_name=run_name, ) dev_evaluator = TripletEvaluator( anchors=eval_dataset["query"], positives=eval_dataset["positive"], negatives=eval_dataset["negative"], name="msmarco-co-condenser-dev", ) dev_evaluator(model) trainer = SentenceTransformerTrainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, loss=loss, evaluator=dev_evaluator, ) trainer.train() model.save_pretrained(f"output/{model_shortname}/{run_name}/final") model.push_to_hub(run_name, private=False) if __name__ == "__main__": main() ```
### Cross-lingual Classification
Click to expand multilingual classification fine-tuning example ```python from transformers import ( AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer ) from datasets import load_dataset import numpy as np from sklearn.metrics import accuracy_score, f1_score def compute_metrics(eval_pred): predictions, labels = eval_pred predictions = np.argmax(predictions, axis=1) return { 'accuracy': accuracy_score(labels, predictions), 'f1': f1_score(labels, predictions, average='weighted') } def main(): model_name = "jhu-clsp/mmBERT-base" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=3 ) dataset = load_dataset("xnli", "all_languages") def tokenize_function(examples): texts = [f"{p} {tokenizer.sep_token} {h}" for p, h in zip(examples["premise"], examples["hypothesis"])] return tokenizer( texts, truncation=True, padding=True, max_length=512 ) train_dataset = dataset["train"].map(tokenize_function, batched=True) eval_dataset = dataset["validation"].map(tokenize_function, batched=True) training_args = TrainingArguments( output_dir="./mmbert-xnli", learning_rate=3e-5, per_device_train_batch_size=32, per_device_eval_batch_size=32, num_train_epochs=3, weight_decay=0.01, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="f1", greater_is_better=True, ) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, compute_metrics=compute_metrics, ) trainer.train() if __name__ == "__main__": main() ```
### Multilingual Reranking
Click to expand multilingual reranking fine-tuning example ```python import logging from datasets import load_dataset from sentence_transformers.cross_encoder import ( CrossEncoder, CrossEncoderModelCardData, CrossEncoderTrainer, CrossEncoderTrainingArguments, ) from sentence_transformers.cross_encoder.evaluation import CrossEncoderNanoBEIREvaluator from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss from sentence_transformers.util import mine_hard_negatives from sentence_transformers import SentenceTransformer import torch def main(): model_name = "jhu-clsp/mmBERT-base" train_batch_size = 32 num_epochs = 2 num_hard_negatives = 7 model = CrossEncoder( model_name, model_card_data=CrossEncoderModelCardData( language="multilingual", license="mit", ), ) full_dataset = load_dataset("sentence-transformers/gooaq", split="train").select(range(50_000)) dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=42) train_dataset = dataset_dict["train"] eval_dataset = dataset_dict["test"] embedding_model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", device="cpu") hard_train_dataset = mine_hard_negatives( train_dataset, embedding_model, num_negatives=num_hard_negatives, margin=0, range_min=0, range_max=100, sampling_strategy="top", batch_size=2048, output_format="labeled-pair", use_faiss=True, ) loss = BinaryCrossEntropyLoss(model=model, pos_weight=torch.tensor(num_hard_negatives)) nano_beir_evaluator = CrossEncoderNanoBEIREvaluator( dataset_names=["msmarco", "nfcorpus", "nq"], batch_size=train_batch_size, ) args = CrossEncoderTrainingArguments( output_dir="./mmbert-reranker", num_train_epochs=num_epochs, per_device_train_batch_size=train_batch_size, per_device_eval_batch_size=train_batch_size, learning_rate=2e-5, warmup_ratio=0.1, fp16=False, bf16=True, dataloader_num_workers=4, load_best_model_at_end=True, metric_for_best_model="eval_msmarco_ndcg@10", eval_strategy="steps", eval_steps=1000, save_strategy="steps", save_steps=1000, save_total_limit=2, logging_steps=200, seed=42, ) trainer = CrossEncoderTrainer( model=model, args=args, train_dataset=hard_train_dataset, loss=loss, evaluator=nano_beir_evaluator, ) trainer.train() model.save_pretrained("./mmbert-reranker/final") if __name__ == "__main__": main() ```
## Training Data mmBERT was trained on a carefully curated 3T+ token multilingual dataset: | Phase | Dataset | Description | |:------|:--------|:------------| | [Pre-training P1](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs) | 2.3T tokens | 60 languages, diverse data mixture | | [Pre-training P2](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p2-fineweb2-langs) | - | Extension data for pre-training | | [Pre-training P3](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p3-fineweb2-langs) | - | Final pre-training data | | [Mid-training](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining-data) | 600B tokens | 110 languages, context extension | | [Decay Phase](https://huggingface.co/datasets/jhu-clsp/mmbert-decay-data) | 100B tokens | 1833 languages, premium quality | **Primary Sources:** - **Filtered DCLM**: High-quality English content - **FineWeb2**: Broad multilingual web coverage (1800+ languages) - **FineWeb2-HQ**: Filtered subset of 20 high-resource languages - **Code**: StarCoder and ProLong repositories - **Academic**: ArXiv papers and PeS2o scientific content - **Reference**: Wikipedia (MegaWika) and textbooks - **Community**: StackExchange discussions ## Citation If you use mmBERT in your research, please cite our work: ```bibtex @misc{marone2025mmbertmodernmultilingualencoder, title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme}, year={2025}, eprint={2509.06888}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.06888}, } ``` """