Update README.md
Browse files
README.md
CHANGED
|
@@ -8,9 +8,39 @@ base_model:
|
|
| 8 |
- Qwen/Qwen3-1.7B
|
| 9 |
---
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
To
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
```python
|
| 16 |
import mteb
|
|
|
|
| 8 |
- Qwen/Qwen3-1.7B
|
| 9 |
---
|
| 10 |
|
| 11 |
+
F2LLMs (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines.
|
| 12 |
|
| 13 |
+
To encode a batch of sentences:
|
| 14 |
+
|
| 15 |
+
```python
|
| 16 |
+
from transformers import AutoModel, AutoTokenizer
|
| 17 |
+
import torch
|
| 18 |
+
import torch.nn.functional as F
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
model_path = "codefuse-ai/F2LLM-1.7B"
|
| 22 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 23 |
+
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0})
|
| 24 |
+
|
| 25 |
+
sentences = [
|
| 26 |
+
'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
|
| 27 |
+
'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.'
|
| 28 |
+
]
|
| 29 |
+
|
| 30 |
+
def encode(sentences):
|
| 31 |
+
batch_size = len(sentences)
|
| 32 |
+
sentences = [s+tokenizer.eos_token for s in sentences]
|
| 33 |
+
tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt', add_special_tokens=False).to(model.device)
|
| 34 |
+
last_hidden_state = model(**tokenized_inputs).last_hidden_state
|
| 35 |
+
eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1
|
| 36 |
+
embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions]
|
| 37 |
+
embeddings = F.normalize(embeddings, p=2, dim=1)
|
| 38 |
+
return embeddings
|
| 39 |
+
|
| 40 |
+
embeddings = encode(sentences)
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
To evaluate F2LLMs on MTEB (currently requires installing MTEB from source):
|
| 44 |
|
| 45 |
```python
|
| 46 |
import mteb
|