Geralt-Targaryen commited on
Commit
93701ce
·
verified ·
1 Parent(s): 52edc3d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -2
README.md CHANGED
@@ -8,9 +8,39 @@ base_model:
8
  - Qwen/Qwen3-1.7B
9
  ---
10
 
11
- F2LLM (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines.
12
 
13
- To evaluate F2LLMs on MTEB:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ```python
16
  import mteb
 
8
  - Qwen/Qwen3-1.7B
9
  ---
10
 
11
+ F2LLMs (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines.
12
 
13
+ To encode a batch of sentences:
14
+
15
+ ```python
16
+ from transformers import AutoModel, AutoTokenizer
17
+ import torch
18
+ import torch.nn.functional as F
19
+
20
+
21
+ model_path = "codefuse-ai/F2LLM-1.7B"
22
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
23
+ model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0})
24
+
25
+ sentences = [
26
+ 'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
27
+ 'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.'
28
+ ]
29
+
30
+ def encode(sentences):
31
+ batch_size = len(sentences)
32
+ sentences = [s+tokenizer.eos_token for s in sentences]
33
+ tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt', add_special_tokens=False).to(model.device)
34
+ last_hidden_state = model(**tokenized_inputs).last_hidden_state
35
+ eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1
36
+ embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions]
37
+ embeddings = F.normalize(embeddings, p=2, dim=1)
38
+ return embeddings
39
+
40
+ embeddings = encode(sentences)
41
+ ```
42
+
43
+ To evaluate F2LLMs on MTEB (currently requires installing MTEB from source):
44
 
45
  ```python
46
  import mteb