Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -7,5 +7,36 @@ This model has been first pretrained on the BEIR corpus and fine-tuned on MS MAR | |
| 7 | 
             
            This model is trained with BERT-base as the backbone with 110M hyperparameters.
         | 
| 8 |  | 
| 9 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 10 |  | 
| 11 |  | 
|  | |
| 7 | 
             
            This model is trained with BERT-base as the backbone with 110M hyperparameters.
         | 
| 8 |  | 
| 9 |  | 
| 10 | 
            +
            ## Usage
         | 
| 11 | 
            +
             | 
| 12 | 
            +
            Pre-trained models can be loaded through the HuggingFace transformers library:
         | 
| 13 | 
            +
             | 
| 14 | 
            +
            ```python
         | 
| 15 | 
            +
            from transformers import AutoModel, AutoTokenizer
         | 
| 16 | 
            +
             | 
| 17 | 
            +
            model = AutoModel.from_pretrained("OpenMatch/cocodr-base-msmarco") 
         | 
| 18 | 
            +
            tokenizer = AutoTokenizer.from_pretrained("OpenMatch/cocodr-base-msmarco") 
         | 
| 19 | 
            +
            ```
         | 
| 20 | 
            +
             | 
| 21 | 
            +
            Then embeddings for different sentences can be obtained by doing the following:
         | 
| 22 | 
            +
             | 
| 23 | 
            +
            ```python
         | 
| 24 | 
            +
             | 
| 25 | 
            +
            sentences = [
         | 
| 26 | 
            +
                "Where was Marie Curie born?",
         | 
| 27 | 
            +
                "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
         | 
| 28 | 
            +
                "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
         | 
| 29 | 
            +
            ]
         | 
| 30 | 
            +
             | 
| 31 | 
            +
            inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
         | 
| 32 | 
            +
            embeddings = model(**inputs, output_hidden_states=True, return_dict=True).hidden_states[-1][:, :1].squeeze(1) # the embedding of the [CLS] token after the final layer
         | 
| 33 | 
            +
            ```
         | 
| 34 | 
            +
             | 
| 35 | 
            +
            Then similarity scores between the different sentences are obtained with a dot product between the embeddings:
         | 
| 36 | 
            +
            ```python
         | 
| 37 | 
            +
             | 
| 38 | 
            +
            score01 = embeddings[0] @ embeddings[1] # 216.9792
         | 
| 39 | 
            +
            score02 = embeddings[0] @ embeddings[2] # 216.6684
         | 
| 40 | 
            +
            ```
         | 
| 41 |  | 
| 42 |  |