Add new SentenceTransformer model
Browse files- 1_Dense/config.json +1 -0
- 1_Dense/model.safetensors +3 -0
- README.md +238 -0
- config.json +26 -0
- config_sentence_transformers.json +49 -0
- model.safetensors +3 -0
- modules.json +14 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +31 -0
- tokenizer.json +0 -0
- tokenizer_config.json +71 -0
- vocab.txt +0 -0
    	
        1_Dense/config.json
    ADDED
    
    | @@ -0,0 +1 @@ | |
|  | 
|  | |
| 1 | 
            +
            {"in_features": 768, "out_features": 128, "bias": false, "activation_function": "torch.nn.modules.linear.Identity"}
         | 
    	
        1_Dense/model.safetensors
    ADDED
    
    | @@ -0,0 +1,3 @@ | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            version https://git-lfs.github.com/spec/v1
         | 
| 2 | 
            +
            oid sha256:cb3d683617f336df4ee6033b8afa40648ee2f9030408704db63a8fe531489400
         | 
| 3 | 
            +
            size 393304
         | 
    	
        README.md
    ADDED
    
    | @@ -0,0 +1,238 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ---
         | 
| 2 | 
            +
            tags:
         | 
| 3 | 
            +
            - ColBERT
         | 
| 4 | 
            +
            - PyLate
         | 
| 5 | 
            +
            - sentence-transformers
         | 
| 6 | 
            +
            - sentence-similarity
         | 
| 7 | 
            +
            - feature-extraction
         | 
| 8 | 
            +
            base_model: colbert-ir/colbertv2.0
         | 
| 9 | 
            +
            pipeline_tag: sentence-similarity
         | 
| 10 | 
            +
            library_name: PyLate
         | 
| 11 | 
            +
            ---
         | 
| 12 | 
            +
             | 
| 13 | 
            +
            # PyLate model based on colbert-ir/colbertv2.0
         | 
| 14 | 
            +
             | 
| 15 | 
            +
            This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [colbert-ir/colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0). It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
         | 
| 16 | 
            +
             | 
| 17 | 
            +
            ## Model Details
         | 
| 18 | 
            +
             | 
| 19 | 
            +
            ### Model Description
         | 
| 20 | 
            +
            - **Model Type:** PyLate model
         | 
| 21 | 
            +
            - **Base model:** [colbert-ir/colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0) <!-- at revision c1e84128e85ef755c096a95bdb06b47793b13acf -->
         | 
| 22 | 
            +
            - **Document Length:** 300 tokens
         | 
| 23 | 
            +
            - **Query Length:** 32 tokens
         | 
| 24 | 
            +
            - **Output Dimensionality:** 128 tokens
         | 
| 25 | 
            +
            - **Similarity Function:** MaxSim
         | 
| 26 | 
            +
            <!-- - **Training Dataset:** Unknown -->
         | 
| 27 | 
            +
            <!-- - **Language:** Unknown -->
         | 
| 28 | 
            +
            <!-- - **License:** Unknown -->
         | 
| 29 | 
            +
             | 
| 30 | 
            +
            ### Model Sources
         | 
| 31 | 
            +
             | 
| 32 | 
            +
            - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
         | 
| 33 | 
            +
            - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
         | 
| 34 | 
            +
            - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
         | 
| 35 | 
            +
             | 
| 36 | 
            +
            ### Full Model Architecture
         | 
| 37 | 
            +
             | 
| 38 | 
            +
            ```
         | 
| 39 | 
            +
            ColBERT(
         | 
| 40 | 
            +
              (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
         | 
| 41 | 
            +
              (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
         | 
| 42 | 
            +
            )
         | 
| 43 | 
            +
            ```
         | 
| 44 | 
            +
             | 
| 45 | 
            +
            ## Usage
         | 
| 46 | 
            +
            First install the PyLate library:
         | 
| 47 | 
            +
             | 
| 48 | 
            +
            ```bash
         | 
| 49 | 
            +
            pip install -U pylate
         | 
| 50 | 
            +
            ```
         | 
| 51 | 
            +
             | 
| 52 | 
            +
            ### Retrieval 
         | 
| 53 | 
            +
             | 
| 54 | 
            +
            PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
         | 
| 55 | 
            +
             | 
| 56 | 
            +
            #### Indexing documents
         | 
| 57 | 
            +
             | 
| 58 | 
            +
            First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
         | 
| 59 | 
            +
             | 
| 60 | 
            +
            ```python
         | 
| 61 | 
            +
            from pylate import indexes, models, retrieve
         | 
| 62 | 
            +
             | 
| 63 | 
            +
            # Step 1: Load the ColBERT model
         | 
| 64 | 
            +
            model = models.ColBERT(
         | 
| 65 | 
            +
                model_name_or_path=NohTow/colbertv2.0,
         | 
| 66 | 
            +
            )
         | 
| 67 | 
            +
             | 
| 68 | 
            +
            # Step 2: Initialize the Voyager index
         | 
| 69 | 
            +
            index = indexes.Voyager(
         | 
| 70 | 
            +
                index_folder="pylate-index",
         | 
| 71 | 
            +
                index_name="index",
         | 
| 72 | 
            +
                override=True,  # This overwrites the existing index if any
         | 
| 73 | 
            +
            )
         | 
| 74 | 
            +
             | 
| 75 | 
            +
            # Step 3: Encode the documents
         | 
| 76 | 
            +
            documents_ids = ["1", "2", "3"]
         | 
| 77 | 
            +
            documents = ["document 1 text", "document 2 text", "document 3 text"]
         | 
| 78 | 
            +
             | 
| 79 | 
            +
            documents_embeddings = model.encode(
         | 
| 80 | 
            +
                documents,
         | 
| 81 | 
            +
                batch_size=32,
         | 
| 82 | 
            +
                is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
         | 
| 83 | 
            +
                show_progress_bar=True,
         | 
| 84 | 
            +
            )
         | 
| 85 | 
            +
             | 
| 86 | 
            +
            # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
         | 
| 87 | 
            +
            index.add_documents(
         | 
| 88 | 
            +
                documents_ids=documents_ids,
         | 
| 89 | 
            +
                documents_embeddings=documents_embeddings,
         | 
| 90 | 
            +
            )
         | 
| 91 | 
            +
            ```
         | 
| 92 | 
            +
             | 
| 93 | 
            +
            Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
         | 
| 94 | 
            +
             | 
| 95 | 
            +
            ```python
         | 
| 96 | 
            +
            # To load an index, simply instantiate it with the correct folder/name and without overriding it
         | 
| 97 | 
            +
            index = indexes.Voyager(
         | 
| 98 | 
            +
                index_folder="pylate-index",
         | 
| 99 | 
            +
                index_name="index",
         | 
| 100 | 
            +
            )
         | 
| 101 | 
            +
            ```
         | 
| 102 | 
            +
             | 
| 103 | 
            +
            #### Retrieving top-k documents for queries
         | 
| 104 | 
            +
             | 
| 105 | 
            +
            Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
         | 
| 106 | 
            +
            To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
         | 
| 107 | 
            +
             | 
| 108 | 
            +
            ```python
         | 
| 109 | 
            +
            # Step 1: Initialize the ColBERT retriever
         | 
| 110 | 
            +
            retriever = retrieve.ColBERT(index=index)
         | 
| 111 | 
            +
             | 
| 112 | 
            +
            # Step 2: Encode the queries
         | 
| 113 | 
            +
            queries_embeddings = model.encode(
         | 
| 114 | 
            +
                ["query for document 3", "query for document 1"],
         | 
| 115 | 
            +
                batch_size=32,
         | 
| 116 | 
            +
                is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
         | 
| 117 | 
            +
                show_progress_bar=True,
         | 
| 118 | 
            +
            )
         | 
| 119 | 
            +
             | 
| 120 | 
            +
            # Step 3: Retrieve top-k documents
         | 
| 121 | 
            +
            scores = retriever.retrieve(
         | 
| 122 | 
            +
                queries_embeddings=queries_embeddings, 
         | 
| 123 | 
            +
                k=10,  # Retrieve the top 10 matches for each query
         | 
| 124 | 
            +
            )
         | 
| 125 | 
            +
            ```
         | 
| 126 | 
            +
             | 
| 127 | 
            +
            ### Reranking
         | 
| 128 | 
            +
            If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
         | 
| 129 | 
            +
             | 
| 130 | 
            +
            ```python
         | 
| 131 | 
            +
            from pylate import rank, models
         | 
| 132 | 
            +
             | 
| 133 | 
            +
            queries = [
         | 
| 134 | 
            +
                "query A",
         | 
| 135 | 
            +
                "query B",
         | 
| 136 | 
            +
            ]
         | 
| 137 | 
            +
             | 
| 138 | 
            +
            documents = [
         | 
| 139 | 
            +
                ["document A", "document B"],
         | 
| 140 | 
            +
                ["document 1", "document C", "document B"],
         | 
| 141 | 
            +
            ]
         | 
| 142 | 
            +
             | 
| 143 | 
            +
            documents_ids = [
         | 
| 144 | 
            +
                [1, 2],
         | 
| 145 | 
            +
                [1, 3, 2],
         | 
| 146 | 
            +
            ]
         | 
| 147 | 
            +
             | 
| 148 | 
            +
            model = models.ColBERT(
         | 
| 149 | 
            +
                model_name_or_path=NohTow/colbertv2.0,
         | 
| 150 | 
            +
            )
         | 
| 151 | 
            +
             | 
| 152 | 
            +
            queries_embeddings = model.encode(
         | 
| 153 | 
            +
                queries,
         | 
| 154 | 
            +
                is_query=True,
         | 
| 155 | 
            +
            )
         | 
| 156 | 
            +
             | 
| 157 | 
            +
            documents_embeddings = model.encode(
         | 
| 158 | 
            +
                documents,
         | 
| 159 | 
            +
                is_query=False,
         | 
| 160 | 
            +
            )
         | 
| 161 | 
            +
             | 
| 162 | 
            +
            reranked_documents = rank.rerank(
         | 
| 163 | 
            +
                documents_ids=documents_ids,
         | 
| 164 | 
            +
                queries_embeddings=queries_embeddings,
         | 
| 165 | 
            +
                documents_embeddings=documents_embeddings,
         | 
| 166 | 
            +
            )
         | 
| 167 | 
            +
            ```
         | 
| 168 | 
            +
             | 
| 169 | 
            +
            <!--
         | 
| 170 | 
            +
            ### Direct Usage (Transformers)
         | 
| 171 | 
            +
             | 
| 172 | 
            +
            <details><summary>Click to see the direct usage in Transformers</summary>
         | 
| 173 | 
            +
             | 
| 174 | 
            +
            </details>
         | 
| 175 | 
            +
            -->
         | 
| 176 | 
            +
             | 
| 177 | 
            +
            <!--
         | 
| 178 | 
            +
            ### Downstream Usage (Sentence Transformers)
         | 
| 179 | 
            +
             | 
| 180 | 
            +
            You can finetune this model on your own dataset.
         | 
| 181 | 
            +
             | 
| 182 | 
            +
            <details><summary>Click to expand</summary>
         | 
| 183 | 
            +
             | 
| 184 | 
            +
            </details>
         | 
| 185 | 
            +
            -->
         | 
| 186 | 
            +
             | 
| 187 | 
            +
            <!--
         | 
| 188 | 
            +
            ### Out-of-Scope Use
         | 
| 189 | 
            +
             | 
| 190 | 
            +
            *List how the model may foreseeably be misused and address what users ought not to do with the model.*
         | 
| 191 | 
            +
            -->
         | 
| 192 | 
            +
             | 
| 193 | 
            +
            <!--
         | 
| 194 | 
            +
            ## Bias, Risks and Limitations
         | 
| 195 | 
            +
             | 
| 196 | 
            +
            *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
         | 
| 197 | 
            +
            -->
         | 
| 198 | 
            +
             | 
| 199 | 
            +
            <!--
         | 
| 200 | 
            +
            ### Recommendations
         | 
| 201 | 
            +
             | 
| 202 | 
            +
            *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
         | 
| 203 | 
            +
            -->
         | 
| 204 | 
            +
             | 
| 205 | 
            +
            ## Training Details
         | 
| 206 | 
            +
             | 
| 207 | 
            +
            ### Framework Versions
         | 
| 208 | 
            +
            - Python: 3.11.10
         | 
| 209 | 
            +
            - Sentence Transformers: 3.3.1
         | 
| 210 | 
            +
            - PyLate: 1.1.2
         | 
| 211 | 
            +
            - Transformers: 4.46.2
         | 
| 212 | 
            +
            - PyTorch: 2.5.1+cu124
         | 
| 213 | 
            +
            - Accelerate: 1.1.1
         | 
| 214 | 
            +
            - Datasets: 3.1.0
         | 
| 215 | 
            +
            - Tokenizers: 0.20.3
         | 
| 216 | 
            +
             | 
| 217 | 
            +
             | 
| 218 | 
            +
            ## Citation
         | 
| 219 | 
            +
             | 
| 220 | 
            +
            ### BibTeX
         | 
| 221 | 
            +
             | 
| 222 | 
            +
            <!--
         | 
| 223 | 
            +
            ## Glossary
         | 
| 224 | 
            +
             | 
| 225 | 
            +
            *Clearly define terms in order to be accessible across audiences.*
         | 
| 226 | 
            +
            -->
         | 
| 227 | 
            +
             | 
| 228 | 
            +
            <!--
         | 
| 229 | 
            +
            ## Model Card Authors
         | 
| 230 | 
            +
             | 
| 231 | 
            +
            *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
         | 
| 232 | 
            +
            -->
         | 
| 233 | 
            +
             | 
| 234 | 
            +
            <!--
         | 
| 235 | 
            +
            ## Model Card Contact
         | 
| 236 | 
            +
             | 
| 237 | 
            +
            *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
         | 
| 238 | 
            +
            -->
         | 
    	
        config.json
    ADDED
    
    | @@ -0,0 +1,26 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            {
         | 
| 2 | 
            +
              "_name_or_path": "colbert-ir/colbertv2.0",
         | 
| 3 | 
            +
              "architectures": [
         | 
| 4 | 
            +
                "BertModel"
         | 
| 5 | 
            +
              ],
         | 
| 6 | 
            +
              "attention_probs_dropout_prob": 0.1,
         | 
| 7 | 
            +
              "classifier_dropout": null,
         | 
| 8 | 
            +
              "gradient_checkpointing": false,
         | 
| 9 | 
            +
              "hidden_act": "gelu",
         | 
| 10 | 
            +
              "hidden_dropout_prob": 0.1,
         | 
| 11 | 
            +
              "hidden_size": 768,
         | 
| 12 | 
            +
              "initializer_range": 0.02,
         | 
| 13 | 
            +
              "intermediate_size": 3072,
         | 
| 14 | 
            +
              "layer_norm_eps": 1e-12,
         | 
| 15 | 
            +
              "max_position_embeddings": 512,
         | 
| 16 | 
            +
              "model_type": "bert",
         | 
| 17 | 
            +
              "num_attention_heads": 12,
         | 
| 18 | 
            +
              "num_hidden_layers": 12,
         | 
| 19 | 
            +
              "pad_token_id": 0,
         | 
| 20 | 
            +
              "position_embedding_type": "absolute",
         | 
| 21 | 
            +
              "torch_dtype": "float32",
         | 
| 22 | 
            +
              "transformers_version": "4.46.2",
         | 
| 23 | 
            +
              "type_vocab_size": 2,
         | 
| 24 | 
            +
              "use_cache": true,
         | 
| 25 | 
            +
              "vocab_size": 30522
         | 
| 26 | 
            +
            }
         | 
    	
        config_sentence_transformers.json
    ADDED
    
    | @@ -0,0 +1,49 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            {
         | 
| 2 | 
            +
              "__version__": {
         | 
| 3 | 
            +
                "sentence_transformers": "3.3.1",
         | 
| 4 | 
            +
                "transformers": "4.46.2",
         | 
| 5 | 
            +
                "pytorch": "2.5.1+cu124"
         | 
| 6 | 
            +
              },
         | 
| 7 | 
            +
              "prompts": {},
         | 
| 8 | 
            +
              "default_prompt_name": null,
         | 
| 9 | 
            +
              "similarity_fn_name": "cosine",
         | 
| 10 | 
            +
              "query_prefix": "[unused0]",
         | 
| 11 | 
            +
              "document_prefix": "[unused1]",
         | 
| 12 | 
            +
              "query_length": 32,
         | 
| 13 | 
            +
              "document_length": 300,
         | 
| 14 | 
            +
              "attend_to_expansion_tokens": false,
         | 
| 15 | 
            +
              "skiplist_words": [
         | 
| 16 | 
            +
                "!",
         | 
| 17 | 
            +
                "\"",
         | 
| 18 | 
            +
                "#",
         | 
| 19 | 
            +
                "$",
         | 
| 20 | 
            +
                "%",
         | 
| 21 | 
            +
                "&",
         | 
| 22 | 
            +
                "'",
         | 
| 23 | 
            +
                "(",
         | 
| 24 | 
            +
                ")",
         | 
| 25 | 
            +
                "*",
         | 
| 26 | 
            +
                "+",
         | 
| 27 | 
            +
                ",",
         | 
| 28 | 
            +
                "-",
         | 
| 29 | 
            +
                ".",
         | 
| 30 | 
            +
                "/",
         | 
| 31 | 
            +
                ":",
         | 
| 32 | 
            +
                ";",
         | 
| 33 | 
            +
                "<",
         | 
| 34 | 
            +
                "=",
         | 
| 35 | 
            +
                ">",
         | 
| 36 | 
            +
                "?",
         | 
| 37 | 
            +
                "@",
         | 
| 38 | 
            +
                "[",
         | 
| 39 | 
            +
                "\\",
         | 
| 40 | 
            +
                "]",
         | 
| 41 | 
            +
                "^",
         | 
| 42 | 
            +
                "_",
         | 
| 43 | 
            +
                "`",
         | 
| 44 | 
            +
                "{",
         | 
| 45 | 
            +
                "|",
         | 
| 46 | 
            +
                "}",
         | 
| 47 | 
            +
                "~"
         | 
| 48 | 
            +
              ]
         | 
| 49 | 
            +
            }
         | 
    	
        model.safetensors
    ADDED
    
    | @@ -0,0 +1,3 @@ | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            version https://git-lfs.github.com/spec/v1
         | 
| 2 | 
            +
            oid sha256:fc984a3dfbe2a0d8939e0ee4db45aa071da2d9e9ef9817a86e52f5f55a274305
         | 
| 3 | 
            +
            size 437951328
         | 
    	
        modules.json
    ADDED
    
    | @@ -0,0 +1,14 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            [
         | 
| 2 | 
            +
              {
         | 
| 3 | 
            +
                "idx": 0,
         | 
| 4 | 
            +
                "name": "0",
         | 
| 5 | 
            +
                "path": "",
         | 
| 6 | 
            +
                "type": "sentence_transformers.models.Transformer"
         | 
| 7 | 
            +
              },
         | 
| 8 | 
            +
              {
         | 
| 9 | 
            +
                "idx": 1,
         | 
| 10 | 
            +
                "name": "1",
         | 
| 11 | 
            +
                "path": "1_Dense",
         | 
| 12 | 
            +
                "type": "pylate.models.Dense.Dense"
         | 
| 13 | 
            +
              }
         | 
| 14 | 
            +
            ]
         | 
    	
        sentence_bert_config.json
    ADDED
    
    | @@ -0,0 +1,4 @@ | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            {
         | 
| 2 | 
            +
              "max_seq_length": 512,
         | 
| 3 | 
            +
              "do_lower_case": false
         | 
| 4 | 
            +
            }
         | 
    	
        special_tokens_map.json
    ADDED
    
    | @@ -0,0 +1,31 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            {
         | 
| 2 | 
            +
              "cls_token": {
         | 
| 3 | 
            +
                "content": "[CLS]",
         | 
| 4 | 
            +
                "lstrip": false,
         | 
| 5 | 
            +
                "normalized": false,
         | 
| 6 | 
            +
                "rstrip": false,
         | 
| 7 | 
            +
                "single_word": false
         | 
| 8 | 
            +
              },
         | 
| 9 | 
            +
              "mask_token": {
         | 
| 10 | 
            +
                "content": "[MASK]",
         | 
| 11 | 
            +
                "lstrip": false,
         | 
| 12 | 
            +
                "normalized": false,
         | 
| 13 | 
            +
                "rstrip": false,
         | 
| 14 | 
            +
                "single_word": false
         | 
| 15 | 
            +
              },
         | 
| 16 | 
            +
              "pad_token": "[MASK]",
         | 
| 17 | 
            +
              "sep_token": {
         | 
| 18 | 
            +
                "content": "[SEP]",
         | 
| 19 | 
            +
                "lstrip": false,
         | 
| 20 | 
            +
                "normalized": false,
         | 
| 21 | 
            +
                "rstrip": false,
         | 
| 22 | 
            +
                "single_word": false
         | 
| 23 | 
            +
              },
         | 
| 24 | 
            +
              "unk_token": {
         | 
| 25 | 
            +
                "content": "[UNK]",
         | 
| 26 | 
            +
                "lstrip": false,
         | 
| 27 | 
            +
                "normalized": false,
         | 
| 28 | 
            +
                "rstrip": false,
         | 
| 29 | 
            +
                "single_word": false
         | 
| 30 | 
            +
              }
         | 
| 31 | 
            +
            }
         | 
    	
        tokenizer.json
    ADDED
    
    | The diff for this file is too large to render. 
		See raw diff | 
|  | 
    	
        tokenizer_config.json
    ADDED
    
    | @@ -0,0 +1,71 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            {
         | 
| 2 | 
            +
              "added_tokens_decoder": {
         | 
| 3 | 
            +
                "0": {
         | 
| 4 | 
            +
                  "content": "[PAD]",
         | 
| 5 | 
            +
                  "lstrip": false,
         | 
| 6 | 
            +
                  "normalized": false,
         | 
| 7 | 
            +
                  "rstrip": false,
         | 
| 8 | 
            +
                  "single_word": false,
         | 
| 9 | 
            +
                  "special": true
         | 
| 10 | 
            +
                },
         | 
| 11 | 
            +
                "1": {
         | 
| 12 | 
            +
                  "content": "[unused0]",
         | 
| 13 | 
            +
                  "lstrip": false,
         | 
| 14 | 
            +
                  "normalized": true,
         | 
| 15 | 
            +
                  "rstrip": false,
         | 
| 16 | 
            +
                  "single_word": false,
         | 
| 17 | 
            +
                  "special": false
         | 
| 18 | 
            +
                },
         | 
| 19 | 
            +
                "2": {
         | 
| 20 | 
            +
                  "content": "[unused1]",
         | 
| 21 | 
            +
                  "lstrip": false,
         | 
| 22 | 
            +
                  "normalized": true,
         | 
| 23 | 
            +
                  "rstrip": false,
         | 
| 24 | 
            +
                  "single_word": false,
         | 
| 25 | 
            +
                  "special": false
         | 
| 26 | 
            +
                },
         | 
| 27 | 
            +
                "100": {
         | 
| 28 | 
            +
                  "content": "[UNK]",
         | 
| 29 | 
            +
                  "lstrip": false,
         | 
| 30 | 
            +
                  "normalized": false,
         | 
| 31 | 
            +
                  "rstrip": false,
         | 
| 32 | 
            +
                  "single_word": false,
         | 
| 33 | 
            +
                  "special": true
         | 
| 34 | 
            +
                },
         | 
| 35 | 
            +
                "101": {
         | 
| 36 | 
            +
                  "content": "[CLS]",
         | 
| 37 | 
            +
                  "lstrip": false,
         | 
| 38 | 
            +
                  "normalized": false,
         | 
| 39 | 
            +
                  "rstrip": false,
         | 
| 40 | 
            +
                  "single_word": false,
         | 
| 41 | 
            +
                  "special": true
         | 
| 42 | 
            +
                },
         | 
| 43 | 
            +
                "102": {
         | 
| 44 | 
            +
                  "content": "[SEP]",
         | 
| 45 | 
            +
                  "lstrip": false,
         | 
| 46 | 
            +
                  "normalized": false,
         | 
| 47 | 
            +
                  "rstrip": false,
         | 
| 48 | 
            +
                  "single_word": false,
         | 
| 49 | 
            +
                  "special": true
         | 
| 50 | 
            +
                },
         | 
| 51 | 
            +
                "103": {
         | 
| 52 | 
            +
                  "content": "[MASK]",
         | 
| 53 | 
            +
                  "lstrip": false,
         | 
| 54 | 
            +
                  "normalized": false,
         | 
| 55 | 
            +
                  "rstrip": false,
         | 
| 56 | 
            +
                  "single_word": false,
         | 
| 57 | 
            +
                  "special": true
         | 
| 58 | 
            +
                }
         | 
| 59 | 
            +
              },
         | 
| 60 | 
            +
              "clean_up_tokenization_spaces": false,
         | 
| 61 | 
            +
              "cls_token": "[CLS]",
         | 
| 62 | 
            +
              "do_lower_case": true,
         | 
| 63 | 
            +
              "mask_token": "[MASK]",
         | 
| 64 | 
            +
              "model_max_length": 512,
         | 
| 65 | 
            +
              "pad_token": "[MASK]",
         | 
| 66 | 
            +
              "sep_token": "[SEP]",
         | 
| 67 | 
            +
              "strip_accents": null,
         | 
| 68 | 
            +
              "tokenize_chinese_chars": true,
         | 
| 69 | 
            +
              "tokenizer_class": "BertTokenizer",
         | 
| 70 | 
            +
              "unk_token": "[UNK]"
         | 
| 71 | 
            +
            }
         | 
    	
        vocab.txt
    ADDED
    
    | The diff for this file is too large to render. 
		See raw diff | 
|  | 
