Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | 
         @@ -6828,7 +6828,42 @@ but low-resource languages may see performance degradation. 
     | 
|
| 6828 | 
         | 
| 6829 | 
         
             
            ## Training Details
         
     | 
| 6830 | 
         | 
| 6831 | 
         
            -
             
     | 
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 6832 | 
         | 
| 6833 | 
         
             
            ## Benchmark Evaluation
         
     | 
| 6834 | 
         | 
| 
         | 
|
| 6828 | 
         | 
| 6829 | 
         
             
            ## Training Details
         
     | 
| 6830 | 
         | 
| 6831 | 
         
            +
            **Initialization**: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
         
     | 
| 6832 | 
         
            +
             
     | 
| 6833 | 
         
            +
            **First stage**: contrastive pre-training with weak supervision
         
     | 
| 6834 | 
         
            +
             
     | 
| 6835 | 
         
            +
            | Dataset                                                                                                | Weak supervision                      | # of text pairs |
         
     | 
| 6836 | 
         
            +
            |--------------------------------------------------------------------------------------------------------|---------------------------------------|-----------------|
         
     | 
| 6837 | 
         
            +
            | Filtered [mC4](https://huggingface.co/datasets/mc4)                                                    | (title, page content)                 | 1B              |
         
     | 
| 6838 | 
         
            +
            | [CC News](https://huggingface.co/datasets/intfloat/multilingual_cc_news)                               | (title, news content)                 | 400M            |
         
     | 
| 6839 | 
         
            +
            | [NLLB](https://huggingface.co/datasets/allenai/nllb)                                                   | translation pairs                     | 2.4B            |
         
     | 
| 6840 | 
         
            +
            | [Wikipedia](https://huggingface.co/datasets/intfloat/wikipedia)                                        | (hierarchical section title, passage) | 150M            |
         
     | 
| 6841 | 
         
            +
            | Filtered [Reddit](https://www.reddit.com/)                                                             | (comment, response)                   | 800M            |
         
     | 
| 6842 | 
         
            +
            | [S2ORC](https://github.com/allenai/s2orc)                                                              | (title, abstract) and citation pairs  | 100M            |
         
     | 
| 6843 | 
         
            +
            | [Stackexchange](https://stackexchange.com/)                                                            | (question, answer)                    | 50M             |
         
     | 
| 6844 | 
         
            +
            | [xP3](https://huggingface.co/datasets/bigscience/xP3)                                                  | (input prompt, response)              | 80M             |
         
     | 
| 6845 | 
         
            +
            | [Miscellaneous unsupervised SBERT data](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | -                                     | 10M             |
         
     | 
| 6846 | 
         
            +
             
     | 
| 6847 | 
         
            +
            **Second stage**: supervised fine-tuning
         
     | 
| 6848 | 
         
            +
             
     | 
| 6849 | 
         
            +
            | Dataset                                                                                | Language     | # of text pairs |
         
     | 
| 6850 | 
         
            +
            |----------------------------------------------------------------------------------------|--------------|-----------------|
         
     | 
| 6851 | 
         
            +
            | [MS MARCO](https://microsoft.github.io/msmarco/)                                       | English      | 500k            |
         
     | 
| 6852 | 
         
            +
            | [NQ](https://github.com/facebookresearch/DPR)                                          | English      | 70k             |
         
     | 
| 6853 | 
         
            +
            | [Trivia QA](https://github.com/facebookresearch/DPR)                                   | English      | 60k             |
         
     | 
| 6854 | 
         
            +
            | [NLI from SimCSE](https://github.com/princeton-nlp/SimCSE)                             | English      | <300k           |
         
     | 
| 6855 | 
         
            +
            | [ELI5](https://huggingface.co/datasets/eli5)                                           | English      | 500k            |
         
     | 
| 6856 | 
         
            +
            | [DuReader Retrieval](https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval) | Chinese      | 86k             |
         
     | 
| 6857 | 
         
            +
            | [KILT Fever](https://huggingface.co/datasets/kilt_tasks)                               | English      | 70k             |
         
     | 
| 6858 | 
         
            +
            | [KILT HotpotQA](https://huggingface.co/datasets/kilt_tasks)                            | English      | 70k             |
         
     | 
| 6859 | 
         
            +
            | [SQuAD](https://huggingface.co/datasets/squad)                                         | English      | 87k             |
         
     | 
| 6860 | 
         
            +
            | [Quora](https://huggingface.co/datasets/quora)                                         | English      | 150k            |
         
     | 
| 6861 | 
         
            +
            | [Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi)                                                                           | 11 languages | 50k             |
         
     | 
| 6862 | 
         
            +
            | [MIRACL](https://huggingface.co/datasets/miracl/miracl)                                                                             | 16 languages | 40k             |
         
     | 
| 6863 | 
         
            +
             
     | 
| 6864 | 
         
            +
            For all labeled datasets, we only use its training set for fine-tuning.
         
     | 
| 6865 | 
         
            +
             
     | 
| 6866 | 
         
            +
            For other training details, please refer to our paper at [https://arxiv.org/pdf/2212.03533.pdf](https://arxiv.org/pdf/2212.03533.pdf).
         
     | 
| 6867 | 
         | 
| 6868 | 
         
             
            ## Benchmark Evaluation
         
     | 
| 6869 | 
         |