license: mit
datasets:
- jhu-clsp/mmbert-decay
- jhu-clsp/mmbert-midtraining
- jhu-clsp/mmbert-pretrain-p1-fineweb2-langs
- jhu-clsp/mmbert-pretrain-p2-fineweb2-remaining
- jhu-clsp/mmbert-pretrain-p3-others
pipeline_tag: fill-mask
library_name: transformers
language:
- aai
- aak
- aau
- aaz
- aba
- abi
- abk
- abn
- abq
- abs
- abt
- abx
- aby
- abz
- aca
- acd
- ace
- acf
- ach
- acm
- acn
- acr
- acu
- ada
- ade
- adh
- adi
- adj
- adl
- ady
- adz
- aeb
- aer
- aeu
- aey
- afr
- agd
- agg
- agm
- agn
- agr
- agt
- agu
- agw
- agx
- aha
- ahk
- aia
- aii
- aim
- ain
- ajg
- aji
- ajz
- akb
- ake
- akh
- akp
- alj
- aln
- alp
- alq
- als
- alt
- aly
- alz
- ame
- amf
- amh
- ami
- amk
- amm
- amn
- amp
- amr
- amu
- amx
- ang
- anm
- ann
- anp
- anv
- any
- aoi
- aoj
- aom
- aoz
- apb
- apc
- ape
- apn
- apr
- apt
- apu
- apw
- apy
- apz
- arb
- are
- arg
- arl
- arn
- arp
- arq
- ars
- ary
- arz
- asg
- asm
- aso
- ast
- ata
- atb
- atd
- atg
- ati
- atj
- atq
- att
- auc
- aui
- auy
- ava
- avk
- avn
- avt
- avu
- awa
- awb
- awx
- ayo
- ayp
- ayr
- azb
- azg
- azj
- azz
- bak
- bam
- ban
- bao
- bar
- bas
- bav
- bba
- bbb
- bbc
- bbj
- bbk
- bbo
- bbr
- bch
- bci
- bcl
- bco
- bcw
- bdd
- bdh
- bdq
- bea
- bef
- bel
- bem
- ben
- beq
- bew
- bex
- bfd
- bfo
- bgr
- bgs
- bgt
- bgz
- bhg
- bhl
- bho
- bhp
- bhw
- bhz
- bib
- big
- bim
- bin
- bis
- biu
- biv
- bjn
- bjp
- bjr
- bjv
- bkd
- bkl
- bkq
- bku
- bkv
- bla
- blh
- blk
- blw
- blz
- bmh
- bmk
- bmq
- bmr
- bmu
- bmv
- bno
- bnp
- boa
- bod
- boj
- bom
- bon
- bos
- bov
- box
- bpr
- bps
- bpy
- bqc
- bqj
- bqp
- bre
- brh
- bru
- brx
- bsc
- bsn
- bsp
- bsq
- bss
- btd
- bth
- bts
- btt
- btx
- bud
- bug
- buk
- bul
- bum
- bus
- bvc
- bvd
- bvr
- bvz
- bwd
- bwi
- bwq
- bwu
- bxh
- bxr
- byr
- byv
- byx
- bzd
- bzh
- bzi
- bzj
- caa
- cab
- cac
- caf
- cag
- cak
- cao
- cap
- caq
- car
- cas
- cat
- cav
- cax
- cbc
- cbi
- cbk
- cbr
- cbs
- cbt
- cbu
- cbv
- cce
- cco
- ccp
- ceb
- ceg
- cek
- ces
- cfm
- cgc
- cgg
- cha
- chd
- che
- chf
- chj
- chk
- cho
- chq
- chr
- chu
- chv
- chw
- chz
- cjk
- cjo
- cjp
- cjs
- cjv
- ckb
- cko
- ckt
- cle
- clu
- cly
- cme
- cmn
- cmo
- cmr
- cnh
- cni
- cnk
- cnl
- cnt
- cnw
- coe
- cof
- cok
- con
- cop
- cor
- cos
- cot
- cou
- cpa
- cpb
- cpc
- cpu
- cpy
- crh
- crj
- crk
- crl
- crm
- crn
- crs
- crt
- crx
- csb
- csk
- cso
- csw
- csy
- cta
- ctd
- cto
- ctp
- ctu
- cub
- cuc
- cui
- cuk
- cul
- cut
- cux
- cwe
- cwt
- cya
- cym
- czt
- daa
- dad
- daf
- dag
- dah
- dak
- dan
- dar
- ddg
- ddn
- ded
- des
- deu
- dga
- dgc
- dgi
- dgr
- dgz
- dhg
- dhm
- dhv
- did
- dig
- dik
- diq
- dis
- diu
- div
- dje
- djk
- djr
- dks
- dln
- dng
- dnj
- dnw
- dob
- doi
- dop
- dos
- dow
- drg
- dru
- dsb
- dtb
- dtp
- dts
- dty
- dua
- due
- dug
- duo
- dur
- dwr
- dww
- dyi
- dyo
- dyu
- dzo
- ebk
- efi
- eka
- ekk
- eko
- ell
- emi
- eml
- emp
- enb
- enl
- enm
- enq
- enx
- epo
- eri
- ese
- esi
- esk
- ess
- esu
- eto
- etr
- etu
- eus
- eve
- ewe
- ewo
- ext
- eza
- faa
- fad
- fai
- fal
- fan
- fao
- far
- fas
- fat
- ffm
- fij
- fil
- fin
- fit
- fkv
- fmu
- fon
- for
- fra
- frd
- fro
- frp
- frr
- fry
- fub
- fud
- fue
- fuf
- fuh
- fuq
- fur
- fuv
- gaa
- gag
- gah
- gai
- gam
- gaw
- gaz
- gbi
- gbo
- gbr
- gcf
- gcr
- gde
- gdg
- gdn
- gdr
- geb
- gej
- gfk
- ghs
- gid
- gil
- giz
- gjn
- gkn
- gla
- gle
- glg
- glk
- glv
- gmh
- gmv
- gna
- gnb
- gnd
- gng
- gnn
- gnw
- goa
- gof
- gog
- goh
- gom
- gor
- gos
- got
- gqr
- grc
- grt
- gso
- gsw
- gub
- guc
- gud
- gug
- guh
- gui
- guj
- guk
- gul
- gum
- gun
- guo
- guq
- gur
- guu
- guw
- gux
- guz
- gvc
- gvf
- gvl
- gvn
- gwi
- gwr
- gya
- gym
- gyr
- hac
- hae
- hag
- hak
- hat
- hav
- haw
- hay
- hbo
- hch
- heb
- heg
- heh
- her
- hif
- hig
- hil
- hin
- hix
- hla
- hmo
- hmr
- hne
- hnj
- hnn
- hns
- hop
- hot
- hra
- hrv
- hrx
- hsb
- hto
- hub
- hui
- hun
- hus
- huu
- huv
- hvn
- hwc
- hye
- hyw
- ian
- iba
- ibg
- ibo
- icr
- ido
- idu
- ifa
- ifb
- ife
- ifk
- ifu
- ify
- ige
- ign
- ike
- ikk
- ikt
- ikw
- ilb
- ile
- ilo
- imo
- ina
- inb
- ind
- inh
- ino
- iou
- ipi
- iqw
- iri
- irk
- iry
- isd
- ish
- isl
- iso
- ita
- itv
- ium
- ivb
- ivv
- iws
- ixl
- izr
- izz
- jaa
- jac
- jae
- jam
- jav
- jbo
- jbu
- jic
- jiv
- jmc
- jpn
- jra
- jun
- jvn
- kaa
- kab
- kac
- kak
- kal
- kam
- kan
- kao
- kaq
- kas
- kat
- kaz
- kbc
- kbd
- kbh
- kbm
- kbo
- kbp
- kbq
- kbr
- kby
- kca
- kcg
- kck
- kdc
- kde
- kdh
- kdi
- kdj
- kdl
- kdr
- kea
- kei
- kek
- ken
- keo
- ker
- kew
- kez
- kff
- kgf
- kgk
- kgp
- kgr
- kha
- khk
- khm
- khs
- khz
- kia
- kij
- kik
- kin
- kir
- kiu
- kix
- kjb
- kje
- kjh
- kjs
- kkc
- kki
- kkj
- kkl
- kle
- klt
- klv
- kmb
- kmg
- kmh
- kmk
- kmm
- kmo
- kmr
- kms
- kmu
- kmy
- knc
- kne
- knf
- kng
- knj
- knk
- kno
- knv
- knx
- kny
- kog
- koi
- koo
- kor
- kos
- kpe
- kpf
- kpg
- kpj
- kpq
- kpr
- kpv
- kpw
- kpx
- kpz
- kqc
- kqe
- kqf
- kql
- kqn
- kqo
- kqp
- kqs
- kqw
- kqy
- krc
- kri
- krj
- krl
- kru
- krx
- ksb
- ksc
- ksd
- ksf
- ksh
- ksj
- ksp
- ksr
- kss
- ksw
- ktb
- ktj
- ktm
- kto
- ktu
- ktz
- kua
- kub
- kud
- kue
- kuj
- kum
- kup
- kus
- kvg
- kvj
- kvn
- kwd
- kwf
- kwi
- kwj
- kwn
- kwy
- kxc
- kxm
- kxw
- kyc
- kyf
- kyg
- kyq
- kyu
- kyz
- kze
- kzf
- kzj
- lac
- lad
- lai
- laj
- lam
- lao
- lap
- lat
- lbb
- lbe
- lbj
- lbk
- lcm
- lcp
- ldi
- ldn
- lee
- lef
- leh
- lem
- leu
- lew
- lex
- lez
- lfn
- lgg
- lgl
- lgm
- lhi
- lhu
- lia
- lid
- lif
- lij
- lim
- lin
- lip
- lis
- lit
- liv
- ljp
- lki
- llb
- lld
- llg
- lln
- lmk
- lmo
- lmp
- lnd
- lob
- loe
- log
- lok
- lol
- lom
- loq
- loz
- lrc
- lsi
- lsm
- ltg
- ltz
- lua
- lub
- luc
- lud
- lue
- lug
- lun
- luo
- lus
- lvs
- lwg
- lwo
- lww
- lzh
- maa
- mad
- maf
- mag
- mah
- mai
- maj
- mak
- mal
- mam
- maq
- mar
- mas
- mau
- mav
- maw
- maz
- mbb
- mbc
- mbd
- mbf
- mbh
- mbi
- mbj
- mbl
- mbs
- mbt
- mca
- mcb
- mcd
- mcf
- mck
- mcn
- mco
- mcp
- mcq
- mcu
- mda
- mdf
- mdy
- med
- mee
- mej
- mek
- men
- meq
- mer
- met
- meu
- mev
- mfe
- mfg
- mfh
- mfi
- mfk
- mfq
- mfy
- mfz
- mgc
- mgh
- mgo
- mgr
- mhi
- mhl
- mhr
- mhw
- mhx
- mhy
- mib
- mic
- mie
- mif
- mig
- mih
- mil
- mim
- min
- mio
- mip
- miq
- mir
- mit
- miy
- miz
- mjc
- mjw
- mkd
- mkl
- mkn
- mks
- mkz
- mlh
- mlp
- mlt
- mlu
- mmn
- mmo
- mmx
- mna
- mnb
- mnf
- mni
- mnk
- mns
- mnw
- mnx
- mny
- moa
- moc
- mog
- moh
- mop
- mor
- mos
- mox
- mpg
- mph
- mpm
- mpp
- mps
- mpt
- mpx
- mqb
- mqj
- mqy
- mrg
- mri
- mrj
- mrq
- mrv
- mrw
- msb
- msc
- mse
- msk
- msy
- mta
- mtg
- mti
- mto
- mtp
- mua
- mug
- muh
- mui
- mup
- mur
- mus
- mux
- muy
- mva
- mvn
- mvp
- mwc
- mwf
- mwl
- mwm
- mwn
- mwp
- mwq
- mwv
- mww
- mxb
- mxp
- mxq
- mxt
- mxv
- mya
- myb
- myk
- myu
- myv
- myw
- myx
- myy
- mza
- mzh
- mzk
- mzl
- mzm
- mzn
- mzw
- mzz
- nab
- naf
- nah
- nak
- nap
- naq
- nas
- nav
- naw
- nba
- nbc
- nbe
- nbl
- nbq
- nbu
- nca
- nch
- ncj
- ncl
- ncq
- nct
- ncu
- ncx
- ndc
- nde
- ndh
- ndi
- ndj
- ndo
- nds
- ndz
- neb
- new
- nfa
- nfr
- ngb
- ngc
- ngl
- ngp
- ngu
- nhd
- nhe
- nhg
- nhi
- nhk
- nho
- nhr
- nhu
- nhw
- nhx
- nhy
- nia
- nif
- nii
- nij
- nim
- nin
- nio
- niu
- niy
- njb
- njm
- njn
- njo
- njz
- nkf
- nko
- nld
- nlg
- nma
- nmf
- nmh
- nmo
- nmw
- nmz
- nnb
- nng
- nnh
- nnl
- nno
- nnp
- nnq
- nnw
- noa
- nob
- nod
- nog
- non
- nop
- not
- nou
- nov
- nph
- npi
- npl
- npo
- npy
- nqo
- nre
- nrf
- nri
- nrm
- nsa
- nse
- nsm
- nsn
- nso
- nss
- nst
- nsu
- ntp
- ntr
- ntu
- nuj
- nus
- nuy
- nvm
- nwb
- nwi
- nwx
- nxd
- nya
- nyf
- nyk
- nyn
- nyo
- nyu
- nyy
- nza
- nzi
- nzm
- obo
- oci
- ogo
- ojb
- oke
- oku
- okv
- old
- olo
- omb
- omw
- ong
- ons
- ood
- opm
- orv
- ory
- oss
- ota
- otd
- ote
- otm
- otn
- oto
- otq
- ots
- otw
- oym
- ozm
- pab
- pad
- pag
- pah
- pam
- pan
- pao
- pap
- pau
- pbb
- pbc
- pbi
- pbt
- pcd
- pck
- pcm
- pdc
- pdt
- pem
- pfe
- pfl
- phm
- pib
- pio
- pir
- pis
- pjt
- pkb
- plg
- pls
- plt
- plu
- plw
- pma
- pmf
- pmq
- pms
- pmx
- pnb
- pne
- pnt
- pny
- poe
- poh
- poi
- pol
- pon
- por
- pos
- pot
- pov
- poy
- ppk
- ppo
- pps
- prf
- prg
- pri
- prq
- pse
- pss
- ptp
- ptu
- pui
- pwg
- pwn
- pww
- pxm
- qub
- quc
- quf
- qug
- quh
- qul
- qup
- qus
- quw
- quy
- quz
- qva
- qvc
- qve
- qvh
- qvi
- qvm
- qvn
- qvo
- qvs
- qvw
- qvz
- qwh
- qxh
- qxl
- qxn
- qxo
- qxr
- rad
- rai
- rap
- rar
- rav
- raw
- rcf
- rej
- rel
- rgu
- rhg
- ria
- rim
- rjs
- rkb
- rmc
- rme
- rml
- rmn
- rmo
- rmq
- rmy
- rnd
- rng
- rnl
- roh
- ron
- roo
- rop
- row
- rro
- rtm
- rub
- rue
- ruf
- rug
- run
- rup
- rus
- rwo
- sab
- sag
- sah
- san
- sas
- sat
- sba
- sbd
- sbe
- sbl
- sbs
- sby
- sck
- scn
- sco
- sda
- sdc
- sdh
- sdo
- sdq
- seh
- ses
- sey
- sfw
- sgb
- sgc
- sgh
- sgs
- sgw
- sgz
- shi
- shk
- shn
- shp
- shu
- sid
- sig
- sil
- sim
- sin
- sja
- sjo
- sju
- skg
- skr
- sld
- slk
- sll
- slv
- sma
- sme
- smj
- smk
- sml
- smn
- smo
- sms
- smt
- sna
- snc
- snd
- snf
- snn
- snp
- snw
- sny
- soe
- som
- sop
- soq
- sot
- soy
- spa
- spl
- spm
- spp
- sps
- spy
- srd
- sri
- srm
- srn
- srp
- srq
- srr
- ssd
- ssg
- ssw
- ssx
- stn
- stp
- stq
- sua
- suc
- sue
- suk
- sun
- sur
- sus
- suz
- swb
- swc
- swe
- swg
- swh
- swk
- swp
- sxb
- sxn
- syb
- syc
- syl
- szl
- szy
- tab
- tac
- tah
- taj
- tam
- tap
- taq
- tar
- tat
- tav
- taw
- tay
- tbc
- tbg
- tbk
- tbl
- tbo
- tbw
- tby
- tbz
- tca
- tcc
- tcf
- tcs
- tcy
- tcz
- ted
- tee
- tel
- tem
- teo
- ter
- tet
- tew
- tfr
- tgk
- tgo
- tgp
- tha
- thk
- thl
- tif
- tig
- tih
- tik
- tim
- tir
- tiv
- tiy
- tke
- tkl
- tkr
- tku
- tlb
- tlf
- tlh
- tlj
- tll
- tly
- tmc
- tmd
- tna
- tnc
- tnk
- tnn
- tnp
- tnr
- tob
- toc
- tod
- tog
- toh
- toi
- toj
- tok
- ton
- too
- top
- tos
- tpa
- tpi
- tpm
- tpp
- tpt
- tpw
- tpz
- tqo
- trc
- trn
- tro
- trp
- trq
- trs
- trv
- tsc
- tsg
- tsn
- tso
- tsw
- tsz
- ttc
- tte
- ttj
- ttq
- tuc
- tue
- tuf
- tui
- tuk
- tul
- tum
- tuo
- tur
- tuv
- tvk
- tvl
- twi
- twu
- twx
- txq
- txu
- tyv
- tzh
- tzj
- tzl
- tzm
- tzo
- ubr
- ubu
- udm
- udu
- uig
- ukr
- umb
- upv
- ura
- urb
- urd
- urh
- uri
- urk
- urt
- urw
- ury
- usa
- usp
- uth
- uvh
- uvl
- uzn
- uzs
- vag
- vap
- var
- vec
- ven
- vep
- vid
- vie
- viv
- vls
- vmk
- vmw
- vmy
- vol
- vot
- vro
- vun
- vut
- waj
- wal
- wap
- war
- wat
- way
- wba
- wbm
- wbp
- wed
- wer
- wes
- wew
- whg
- whk
- wib
- wim
- wiu
- wln
- wls
- wlv
- wlx
- wmt
- wmw
- wnc
- wnu
- wob
- wol
- wos
- wrk
- wrs
- wsg
- wsk
- wuu
- wuv
- wwa
- xal
- xav
- xbi
- xbr
- xed
- xho
- xla
- xmf
- xmm
- xmv
- xnn
- xog
- xon
- xrb
- xsb
- xsi
- xsm
- xsr
- xsu
- xtd
- xtm
- xtn
- xuo
- yaa
- yad
- yal
- yam
- yan
- yao
- yap
- yaq
- yat
- yaz
- ybb
- yby
- ycn
- ydd
- yim
- yka
- yle
- yli
- yml
- yom
- yon
- yor
- yrb
- yre
- yrk
- yrl
- yss
- yua
- yue
- yuj
- yup
- yut
- yuw
- yuz
- yva
- zaa
- zab
- zac
- zad
- zae
- zai
- zam
- zao
- zar
- zas
- zat
- zav
- zaw
- zca
- zdj
- zea
- zgh
- zia
- ziw
- zne
- zom
- zos
- zpa
- zpc
- zpg
- zpi
- zpj
- zpl
- zpm
- zpo
- zpq
- zpt
- zpu
- zpv
- zpz
- zsm
- zsr
- ztq
- zty
- zul
- zyb
- zyp
mmBERT: A Modern Multilingual Encoder
TL;DR: A state-of-the-art multilingual encoder trained on 3T+ tokens across 1800+ languages, introducing novel techniques for learning low-resource languages during the decay phase.
mmBERT is a modern multilingual encoder that significantly outperforms previous generation models like XLM-R on classification, embedding, and retrieval tasks. Built on the ModernBERT architecture with novel multilingual training innovations, mmBERT demonstrates that low-resource languages can be effectively learned during the decay phase of training. It is also significantly faster than any previous multilingual encoder.
Table of Contents
- Highlights
- Quick Start
- Model Description
- Novel Training Innovations
- Model Family
- Training Data
- Usage Examples
- Fine-tuning Examples
- Model Architecture
- Citation
Quick Start
Installation
pip install torch>=1.9.0
pip install transformers>=4.21.0
Usage
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base")
model = AutoModel.from_pretrained("jhu-clsp/mmBERT-base")
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model(**inputs)
Model Description
mmBERT represents the first significant advancement over XLM-R for massively multilingual encoder models. Key features include:
- Massive Language Coverage - Trained on over 1800 languages with progressive inclusion strategy
- Modern Architecture - Built on ModernBERT foundation with Flash Attention 2 and unpadding techniques
- Novel Training Recipe - Introduces inverse mask scheduling and temperature sampling
- Open Training Data - Complete 3T+ token dataset publicly available
- Decay Phase Innovation - Demonstrates effective learning of low-resource languages in final training phase
The model uses bidirectional attention with masked language modeling objectives, optimized specifically for multilingual understanding and cross-lingual transfer.
Novel Training Innovations
Progressive Language Addition: Start with 60 high-resource languages, expand to 110 mid-resource languages, then include all 1833 languages in decay phase.
Inverse Mask Schedule: Reduce mask ratio from 30% → 15% → 5% across training phases for progressively refined learning.
Inverse Temperature Sampling: Adjust multilingual sampling from high-resource bias (τ=0.7) to uniform sampling (τ=0.3).
Model Merging: Combine English-focused, high-resource, and all-language decay variants using TIES merging.
Model Family
| Model | Total Params | Non-embed Params | Languages | Download |
|---|---|---|---|---|
| mmBERT-small | 140M | 42M | 1800+ | |
| mmBERT-base | 307M | 110M | 1800+ |
Training Data
mmBERT training data is publicly available across different phases:
| Phase | Dataset | Tokens | Description |
|---|---|---|---|
| Pre-training P1 | mmbert-pretrain-p1 | 2.3T | 60 languages, foundational training |
| Pre-training P2 | mmbert-pretrain-p2 | - | Extension data for pre-training phase |
| Pre-training P3 | mmbert-pretrain-p3 | - | Final pre-training data |
| Mid-training | mmbert-midtraining | 600B | 110 languages, context extension to 8K |
| Decay Phase | mmbert-decay | 100B | 1833 languages, premium quality |
Data Sources: Filtered DCLM (English), FineWeb2 (multilingual), FineWeb2-HQ (20 high-resource languages), Wikipedia (MegaWika), code repositories (StarCoder, ProLong), academic papers (ArXiv, PeS2o), and community discussions (StackExchange).
Model Architecture
| Parameter | mmBERT-small | mmBERT-base |
|---|---|---|
| Layers | 22 | 22 |
| Hidden Size | 384 | 768 |
| Intermediate Size | 1152 | 1152 |
| Attention Heads | 6 | 12 |
| Total Parameters | 140M | 307M |
| Non-embedding Parameters | 42M | 110M |
| Max Sequence Length | 8192 | 8192 |
| Vocabulary Size | 256,000 | 256,000 |
| Tokenizer | Gemma 2 | Gemma 2 |
Usage Examples
Masked Language Modeling
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base")
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/mmBERT-base")
def predict_masked_token(text):
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
predictions = outputs.logits[mask_indices]
top_tokens = torch.topk(predictions, 5, dim=-1)
return [tokenizer.decode(token) for token in top_tokens.indices[0]]
# Works across languages
texts = [
"The capital of France is <mask>.",
"La capital de España es <mask>.",
"Die Hauptstadt von Deutschland ist <mask>."
]
for text in texts:
predictions = predict_masked_token(text)
print(f"Text: {text}")
print(f"Predictions: {predictions}")
Cross-lingual Embeddings
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base")
model = AutoModel.from_pretrained("jhu-clsp/mmBERT-base")
def get_embeddings(texts):
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
return embeddings.numpy()
multilingual_texts = [
"Artificial intelligence is transforming technology",
"La inteligencia artificial está transformando la tecnología",
"L'intelligence artificielle transforme la technologie",
"人工智能正在改变技术"
]
embeddings = get_embeddings(multilingual_texts)
similarities = cosine_similarity(embeddings)
print("Cross-lingual similarity matrix:")
print(similarities)
Fine-tuning Examples
Dense Retrieval with Sentence Transformers
Click to expand dense retrieval fine-tuning example
import argparse
from datasets import load_dataset
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
)
from sentence_transformers.evaluation import TripletEvaluator
from sentence_transformers.losses import CachedMultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--lr", type=float, default=8e-5)
parser.add_argument("--model_name", type=str, default="jhu-clsp/mmBERT-base")
args = parser.parse_args()
lr = args.lr
model_name = args.model_name
model_shortname = model_name.split("/")[-1]
model = SentenceTransformer(model_name)
dataset = load_dataset(
"sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1",
"triplet-hard",
split="train",
)
dataset_dict = dataset.train_test_split(test_size=1_000, seed=12)
train_dataset = dataset_dict["train"].select(range(1_250_000))
eval_dataset = dataset_dict["test"]
loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16)
run_name = f"{model_shortname}-DPR-{lr}"
training_args = SentenceTransformerTrainingArguments(
output_dir=f"output/{model_shortname}/{run_name}",
num_train_epochs=1,
per_device_train_batch_size=512,
per_device_eval_batch_size=512,
warmup_ratio=0.05,
fp16=False,
bf16=True,
batch_sampler=BatchSamplers.NO_DUPLICATES,
learning_rate=lr,
save_strategy="steps",
save_steps=500,
save_total_limit=2,
logging_steps=500,
run_name=run_name,
)
dev_evaluator = TripletEvaluator(
anchors=eval_dataset["query"],
positives=eval_dataset["positive"],
negatives=eval_dataset["negative"],
name="msmarco-co-condenser-dev",
)
dev_evaluator(model)
trainer = SentenceTransformerTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
evaluator=dev_evaluator,
)
trainer.train()
model.save_pretrained(f"output/{model_shortname}/{run_name}/final")
model.push_to_hub(run_name, private=False)
if __name__ == "__main__":
main()
Cross-lingual Classification
Click to expand multilingual classification fine-tuning example
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer
)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return {
'accuracy': accuracy_score(labels, predictions),
'f1': f1_score(labels, predictions, average='weighted')
}
def main():
model_name = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=3
)
dataset = load_dataset("xnli", "all_languages")
def tokenize_function(examples):
texts = [f"{p} {tokenizer.sep_token} {h}"
for p, h in zip(examples["premise"], examples["hypothesis"])]
return tokenizer(
texts,
truncation=True,
padding=True,
max_length=512
)
train_dataset = dataset["train"].map(tokenize_function, batched=True)
eval_dataset = dataset["validation"].map(tokenize_function, batched=True)
training_args = TrainingArguments(
output_dir="./mmbert-xnli",
learning_rate=3e-5,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
num_train_epochs=3,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
greater_is_better=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
trainer.train()
if __name__ == "__main__":
main()
Multilingual Reranking
Click to expand multilingual reranking fine-tuning example
import logging
from datasets import load_dataset
from sentence_transformers.cross_encoder import (
CrossEncoder,
CrossEncoderModelCardData,
CrossEncoderTrainer,
CrossEncoderTrainingArguments,
)
from sentence_transformers.cross_encoder.evaluation import CrossEncoderNanoBEIREvaluator
from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss
from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer
import torch
def main():
model_name = "jhu-clsp/mmBERT-base"
train_batch_size = 32
num_epochs = 2
num_hard_negatives = 7
model = CrossEncoder(
model_name,
model_card_data=CrossEncoderModelCardData(
language="multilingual",
license="mit",
),
)
full_dataset = load_dataset("sentence-transformers/gooaq", split="train").select(range(50_000))
dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=42)
train_dataset = dataset_dict["train"]
eval_dataset = dataset_dict["test"]
embedding_model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", device="cpu")
hard_train_dataset = mine_hard_negatives(
train_dataset,
embedding_model,
num_negatives=num_hard_negatives,
margin=0,
range_min=0,
range_max=100,
sampling_strategy="top",
batch_size=2048,
output_format="labeled-pair",
use_faiss=True,
)
loss = BinaryCrossEntropyLoss(model=model, pos_weight=torch.tensor(num_hard_negatives))
nano_beir_evaluator = CrossEncoderNanoBEIREvaluator(
dataset_names=["msmarco", "nfcorpus", "nq"],
batch_size=train_batch_size,
)
args = CrossEncoderTrainingArguments(
output_dir="./mmbert-reranker",
num_train_epochs=num_epochs,
per_device_train_batch_size=train_batch_size,
per_device_eval_batch_size=train_batch_size,
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=False,
bf16=True,
dataloader_num_workers=4,
load_best_model_at_end=True,
metric_for_best_model="eval_msmarco_ndcg@10",
eval_strategy="steps",
eval_steps=1000,
save_strategy="steps",
save_steps=1000,
save_total_limit=2,
logging_steps=200,
seed=42,
)
trainer = CrossEncoderTrainer(
model=model,
args=args,
train_dataset=hard_train_dataset,
loss=loss,
evaluator=nano_beir_evaluator,
)
trainer.train()
model.save_pretrained("./mmbert-reranker/final")
if __name__ == "__main__":
main()
Training Data
mmBERT was trained on a carefully curated 3T+ token multilingual dataset:
| Phase | Dataset | Description |
|---|---|---|
| Pre-training P1 | 2.3T tokens | 60 languages, diverse data mixture |
| Pre-training P2 | - | Extension data for pre-training |
| Pre-training P3 | - | Final pre-training data |
| Mid-training | 600B tokens | 110 languages, context extension |
| Decay Phase | 100B tokens | 1833 languages, premium quality |
Primary Sources:
- Filtered DCLM: High-quality English content
- FineWeb2: Broad multilingual web coverage (1800+ languages)
- FineWeb2-HQ: Filtered subset of 20 high-resource languages
- Code: StarCoder and ProLong repositories
- Academic: ArXiv papers and PeS2o scientific content
- Reference: Wikipedia (MegaWika) and textbooks
- Community: StackExchange discussions
Citation
If you use mmBERT in your research, please cite our work:
@misc{marone2025mmbertmodernmultilingualencoder,
title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},
author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2509.06888},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.06888},
}