opensearch-project
/

opensearch-neural-sparse-encoding-multilingual-v1

Feature Extraction

sentence-transformers

passage-retrieval

document-expansion

text-embeddings-inference

Model card Files Files and versions

zhichao-geng commited on Feb 24

Commit

f16efdf

·

verified ·

1 Parent(s): 8b00140

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -60,11 +60,12 @@ from transformers import AutoModelForMaskedLM, AutoTokenizer
 # get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
-def get_sparse_vector(feature, output):
     values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
     values = torch.log(1 + torch.relu(values))
     values[:,special_token_ids] = 0
-    return values
 # transform the sparse vector to a dict of (token, weight)
 def transform_sparse_vector_to_dict(sparse_vector):
@@ -127,7 +128,7 @@ document_sparse_vector = get_sparse_vector(feature_document, output)
 # get similarity score
 sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
-print(sim_score)   # tensor(7.7400, grad_fn=<DotBackward0>)
 query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
@@ -143,7 +144,6 @@ for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reve
 # score in query: 1.6406, score in document: 0.9018, token: now
 # score in query: 1.6108, score in document: 0.3141, token: ?
 # score in query: 1.2721, score in document: 1.3446, token: ny
-# score in query: 0.6005, score in document: 0.1804, token: in
 ```
 The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match.

 # get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
+def get_sparse_vector(feature, output, prune_ratio=0.1):
     values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
     values = torch.log(1 + torch.relu(values))
     values[:,special_token_ids] = 0
+    max_values = values.max(dim=-1)[0].unsqueeze(1) * prune_ratio
+    return values * (values > max_values)
 # transform the sparse vector to a dict of (token, weight)
 def transform_sparse_vector_to_dict(sparse_vector):
 # get similarity score
 sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
+print(sim_score)   # tensor(7.6317, grad_fn=<DotBackward0>)
 query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
 # score in query: 1.6406, score in document: 0.9018, token: now
 # score in query: 1.6108, score in document: 0.3141, token: ?
 # score in query: 1.2721, score in document: 1.3446, token: ny
 ```
 The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match.