Initial commit
Browse files- README.md +36 -1
- config.json +23 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- tokenizer_config.json +3 -0
- vocab.txt +0 -0
    	
        README.md
    CHANGED
    
    | @@ -1,3 +1,38 @@ | |
| 1 | 
             
            ---
         | 
| 2 | 
            -
             | 
|  | |
|  | |
| 3 | 
             
            ---
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
             
            ---
         | 
| 2 | 
            +
            tags:
         | 
| 3 | 
            +
            - feature-extraction
         | 
| 4 | 
            +
            pipeline_tag: feature-extraction
         | 
| 5 | 
             
            ---
         | 
| 6 | 
            +
            DRAGON+ is a BERT-base sized dense retriever initialized from [RetroMAE](https://huggingface.co/Shitao/RetroMAE) and further trained on the data augmented from MS MARCO corpus, following the approach described in [How to Train Your DRAGON:
         | 
| 7 | 
            +
            Diverse Augmentation Towards Generalizable Dense Retrieval](\url). The associated GitHub repository is available here https://github.com/facebookresearch/dpr-scale/tree/dragon. We use asymmetric dual encoder, with two distinctly parameterized encoders. 
         | 
| 8 | 
            +
            The following models are also available:
         | 
| 9 | 
            +
            Model | Initialization | Query Encoder Path | Context Encoder Path
         | 
| 10 | 
            +
            |---|---|---
         | 
| 11 | 
            +
            DRAGON+ | Shitao/RetroMAE| facebook/dragon-plus-query-encoder | facebook/dragon-plus-context-encoder
         | 
| 12 | 
            +
             | 
| 13 | 
            +
            ## Usage (HuggingFace Transformers)
         | 
| 14 | 
            +
            Using the model directly available in HuggingFace transformers .
         | 
| 15 | 
            +
             | 
| 16 | 
            +
            ```python
         | 
| 17 | 
            +
            import torch
         | 
| 18 | 
            +
            from transformers import AutoTokenizer, AutoModel
         | 
| 19 | 
            +
            tokenizer = AutoTokenizer.from_pretrained('facebook/dragon-plus-query-encoder')
         | 
| 20 | 
            +
            query_encoder = AutoModel.from_pretrained('facebook/dragon-plus-query-encoder')
         | 
| 21 | 
            +
            context_encoder = AutoModel.from_pretrained('facebook/dragon-plus-context-encoder')
         | 
| 22 | 
            +
             | 
| 23 | 
            +
            # We use msmarco query and passages as an example
         | 
| 24 | 
            +
            query =  "Where was Marie Curie born?"
         | 
| 25 | 
            +
            contexts = [
         | 
| 26 | 
            +
                "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
         | 
| 27 | 
            +
                "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
         | 
| 28 | 
            +
            ]
         | 
| 29 | 
            +
            # Apply tokenizer
         | 
| 30 | 
            +
            query_input = tokenizer(query, return_tensors='pt')
         | 
| 31 | 
            +
            ctx_input = tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
         | 
| 32 | 
            +
            # Compute embeddings: take the last-layer hidden state of the [CLS] token
         | 
| 33 | 
            +
            query_emb = query_encoder(**query_input).last_hidden_state[:, 0, :]
         | 
| 34 | 
            +
            ctx_emb = context_encoder(**ctx_input).last_hidden_state[:, 0, :]
         | 
| 35 | 
            +
            # Compute similarity scores using dot product
         | 
| 36 | 
            +
            score1 = query_emb @ ctx_emb[0]  # 396.5625
         | 
| 37 | 
            +
            score2 = query_emb @ ctx_emb[1]  # 393.8340
         | 
| 38 | 
            +
            ```
         | 
    	
        config.json
    ADDED
    
    | @@ -0,0 +1,23 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            {
         | 
| 2 | 
            +
              "architectures": [
         | 
| 3 | 
            +
                "BertForMaskedLM"
         | 
| 4 | 
            +
              ],
         | 
| 5 | 
            +
              "attention_probs_dropout_prob": 0.1,
         | 
| 6 | 
            +
              "gradient_checkpointing": false,
         | 
| 7 | 
            +
              "hidden_act": "gelu",
         | 
| 8 | 
            +
              "hidden_dropout_prob": 0.1,
         | 
| 9 | 
            +
              "hidden_size": 768,
         | 
| 10 | 
            +
              "initializer_range": 0.02,
         | 
| 11 | 
            +
              "intermediate_size": 3072,
         | 
| 12 | 
            +
              "layer_norm_eps": 1e-12,
         | 
| 13 | 
            +
              "max_position_embeddings": 512,
         | 
| 14 | 
            +
              "model_type": "bert",
         | 
| 15 | 
            +
              "num_attention_heads": 12,
         | 
| 16 | 
            +
              "num_hidden_layers": 12,
         | 
| 17 | 
            +
              "pad_token_id": 0,
         | 
| 18 | 
            +
              "position_embedding_type": "absolute",
         | 
| 19 | 
            +
              "transformers_version": "4.6.0.dev0",
         | 
| 20 | 
            +
              "type_vocab_size": 2,
         | 
| 21 | 
            +
              "use_cache": true,
         | 
| 22 | 
            +
              "vocab_size": 30522
         | 
| 23 | 
            +
            }
         | 
    	
        pytorch_model.bin
    ADDED
    
    | @@ -0,0 +1,3 @@ | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            version https://git-lfs.github.com/spec/v1
         | 
| 2 | 
            +
            oid sha256:c6eb0b85010b03dd634fa2a035591f3d8c5bc6ac1188c50a7e3f811526d995f7
         | 
| 3 | 
            +
            size 437995569
         | 
    	
        special_tokens_map.json
    ADDED
    
    | @@ -0,0 +1 @@ | |
|  | 
|  | |
| 1 | 
            +
            {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
         | 
    	
        tokenizer_config.json
    ADDED
    
    | @@ -0,0 +1,3 @@ | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            {
         | 
| 2 | 
            +
              "do_lower_case": true
         | 
| 3 | 
            +
            }
         | 
    	
        vocab.txt
    ADDED
    
    | The diff for this file is too large to render. 
		See raw diff | 
|  | 
