codefuse-ai
/

F2LLM-1.7B

Feature Extraction

text-generation-inference

Model card Files Files and versions

Geralt-Targaryen commited on Sep 29

Commit

93701ce

·

verified ·

1 Parent(s): 52edc3d

Update README.md

Files changed (1) hide show

README.md +32 -2

README.md CHANGED Viewed

@@ -8,9 +8,39 @@ base_model:
 - Qwen/Qwen3-1.7B
 ---
-F2LLM (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines.
-To evaluate F2LLMs on MTEB:
 ```python
 import mteb

 - Qwen/Qwen3-1.7B
 ---
+F2LLMs (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines.
+To encode a batch of sentences:
+```python
+from transformers import AutoModel, AutoTokenizer
+import torch
+import torch.nn.functional as F
+model_path = "codefuse-ai/F2LLM-1.7B"
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0})
+sentences = [
+    'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
+    'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.'
+]
+def encode(sentences):
+    batch_size = len(sentences)
+    sentences = [s+tokenizer.eos_token for s in sentences]
+    tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt', add_special_tokens=False).to(model.device)
+    last_hidden_state = model(**tokenized_inputs).last_hidden_state
+    eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1
+    embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions]
+    embeddings = F.normalize(embeddings, p=2, dim=1)
+    return embeddings
+embeddings = encode(sentences)
+```
+To evaluate F2LLMs on MTEB (currently requires installing MTEB from source):
 ```python
 import mteb