codefuse-ai
/

F2LLM-1.7B

Feature Extraction

text-generation-inference

Model card Files Files and versions

Geralt-Targaryen commited on Oct 3

Commit

fa6be4b

·

verified ·

1 Parent(s): 93701ce

Update README.md

Files changed (1) hide show

README.md +26 -0

README.md CHANGED Viewed

@@ -10,6 +10,8 @@ base_model:
 F2LLMs (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines.
 To encode a batch of sentences:
 ```python
@@ -40,6 +42,8 @@ def encode(sentences):
 embeddings = encode(sentences)
 ```
 To evaluate F2LLMs on MTEB (currently requires installing MTEB from source):
 ```python
@@ -59,3 +63,25 @@ model = mteb.get_model("codefuse-ai/F2LLM-1.7B", device="cuda:0")
 evaluation = mteb.MTEB(tasks=tasks)
 evaluation.run(model, encode_kwargs={"batch_size": 16})
 ```

 F2LLMs (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines.
+## Usage
 To encode a batch of sentences:
 ```python
 embeddings = encode(sentences)
 ```
+## Evaluation
 To evaluate F2LLMs on MTEB (currently requires installing MTEB from source):
 ```python
 evaluation = mteb.MTEB(tasks=tasks)
 evaluation.run(model, encode_kwargs={"batch_size": 16})
 ```
+## Training
+Training code is available in our [Github repo](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/F2LLM).
+## Citation
+If you use the F2LLM models, data, or code, please cite the following technical report.
+```
+@article{2025F2LLM,
+  title={F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data},
+  author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang},
+  journal      = {CoRR},
+  volume       = {abs/2510.02294},
+  year         = {2025},
+  url          = {https://doi.org/10.48550/arXiv.2510.02294},
+  doi          = {10.48550/ARXIV.2510.02294},
+  eprinttype    = {arXiv},
+  eprint       = {2510.02294}
+}
+```