Tokenizer Study
Models comparing the effects of tokenizer properties on pre-training compression, and its relationship with downstream performance.
-
Updated • 10
-
shikhar-srivastava/llama-130m-prenorm-train_c4_2B-tok_llama2
0.1B • Updated • 21 -
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_bpe_unscaled_8192
97.5M • Updated • 6 -
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_unigram_unscaled_8192
97.5M • Updated -
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_bpe_unscaled_8192
97.5M • Updated -
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_unigram_unscaled_8192
97.5M • Updated -
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_bpe_unscaled_8192
97.5M • Updated -
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_bpe_unscaled_8192
97.5M • Updated -
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_bpe_unscaled_8192
97.5M • Updated
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_bpe_unscaled_16384
0.1B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 16384
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_bpe_unscaled_32768
0.1B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 32768
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_bpe_unscaled_49152
0.2B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 49152
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_bpe_unscaled_65536
0.2B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 65536
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_bpe_unscaled_81920
0.2B • Updated • 6Note Model with unscaled script, bpe tokenizer, vocab size 81920
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_bpe_unscaled_98304
0.2B • Updated • 4Note Model with unscaled script, bpe tokenizer, vocab size 98304
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_bpe_unscaled_16384
0.1B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 16384
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_bpe_unscaled_32768
0.1B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 32768
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_bpe_unscaled_49152
0.2B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 49152
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_bpe_unscaled_65536
0.2B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 65536
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_bpe_unscaled_81920
0.2B • Updated • 10Note Model with unscaled script, bpe tokenizer, vocab size 81920
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_bpe_unscaled_98304
0.2B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 98304
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_bpe_unscaled_16384
0.1B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 16384
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_bpe_unscaled_32768
0.1B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 32768
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_bpe_unscaled_49152
0.2B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 49152
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_bpe_unscaled_65536
0.2B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 65536
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_bpe_unscaled_81920
0.2B • Updated • 17Note Model with unscaled script, bpe tokenizer, vocab size 81920
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_bpe_unscaled_98304
0.2B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 98304
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_bpe_unscaled_16384
0.1B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 16384
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_bpe_unscaled_32768
0.1B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 32768
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_bpe_unscaled_49152
0.2B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 49152
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_bpe_unscaled_65536
0.2B • Updated • 4Note Model with unscaled script, bpe tokenizer, vocab size 65536
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_bpe_unscaled_81920
0.2B • Updated • 1Note Model with unscaled script, bpe tokenizer, vocab size 81920
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_bpe_unscaled_16384
0.1B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 16384
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_bpe_unscaled_32768
0.1B • Updated • 1Note Model with unscaled script, bpe tokenizer, vocab size 32768
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_bpe_unscaled_49152
0.2B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 49152
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_bpe_unscaled_65536
0.2B • Updated • 17Note Model with unscaled script, bpe tokenizer, vocab size 65536
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_bpe_unscaled_81920
0.2B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 81920
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_bpe_unscaled_98304
0.2B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 98304
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_bpe_unscaled_114688
0.3B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 114688
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_bpe_unscaled_262144
UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 262144
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_bpe_unscaled_114688
0.3B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 114688
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_bpe_unscaled_114688
0.3B • Updated • 5Note Model with unscaled script, bpe tokenizer, vocab size 114688
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_bpe_unscaled_262144
0.5B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 262144
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_bpe_unscaled_114688
0.3B • Updated • 6Note Model with unscaled script, bpe tokenizer, vocab size 114688
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_unigram_unscaled_16384
0.1B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 16384
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_unigram_unscaled_32768
0.1B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 32768
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_unigram_unscaled_49152
0.2B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 49152
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_unigram_unscaled_65536
0.2B • Updated • 6Note Model with unscaled script, unigram tokenizer, vocab size 65536
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_unigram_unscaled_81920
0.2B • Updated • 3Note Model with unscaled script, unigram tokenizer, vocab size 81920
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_unigram_unscaled_98304
0.2B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 98304
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_unigram_unscaled_114688
0.3B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 114688
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_unigram_unscaled_16384
0.1B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 16384
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_unigram_unscaled_32768
0.1B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 32768
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_unigram_unscaled_49152
0.2B • Updated • 6Note Model with unscaled script, unigram tokenizer, vocab size 49152
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_unigram_unscaled_65536
0.2B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 65536
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_unigram_unscaled_81920
0.2B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 81920
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_unigram_unscaled_98304
0.2B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 98304
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_eng_latn_unigram_unscaled_114688
0.3B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 114688
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_tha_thai_bpe_unscaled_262144
0.5B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 262144
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_unigram_unscaled_8192
97.5M • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 8192
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_unigram_unscaled_16384
0.1B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 16384
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_unigram_unscaled_32768
0.1B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 32768
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_unigram_unscaled_49152
0.2B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 49152
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_unigram_unscaled_65536
0.2B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 65536
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_unigram_unscaled_8192
97.5M • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 8192
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_unigram_unscaled_16384
0.1B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 16384
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_unigram_unscaled_32768
0.1B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 32768
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_unigram_unscaled_49152
0.2B • Updated • 6Note Model with unscaled script, unigram tokenizer, vocab size 49152
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_unigram_unscaled_65536
0.2B • Updated • 18Note Model with unscaled script, unigram tokenizer, vocab size 65536
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_unigram_unscaled_81920
0.2B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 81920
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_bpe_unscaled_262144
0.5B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 262144
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_unigram_unscaled_8192
97.5M • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 8192
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_unigram_unscaled_16384
0.1B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 16384
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_unigram_unscaled_32768
0.1B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 32768
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_unigram_unscaled_49152
0.2B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 49152
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_unigram_unscaled_65536
0.2B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 65536
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_unigram_unscaled_81920
0.2B • Updated • 3Note Model with unscaled script, unigram tokenizer, vocab size 81920
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_bpe_unscaled_98304
0.2B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 98304
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_unigram_unscaled_98304
0.2B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 98304
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_bpe_unscaled_114688
0.3B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 114688
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_unigram_unscaled_114688
0.3B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 114688
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_vie_latn_bpe_unscaled_262144
0.5B • UpdatedNote Model with unscaled script, bpe tokenizer, vocab size 262144
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_urd_arab_unigram_unscaled_81920
0.2B • Updated • 3Note Model with unscaled script, unigram tokenizer, vocab size 81920
shikhar-srivastava/mono_gold_130m_pre_lr1e-4_amh_ethi_unigram_unscaled_98304
0.2B • UpdatedNote Model with unscaled script, unigram tokenizer, vocab size 98304