Geralt-Targaryen commited on
Commit
fa6be4b
·
verified ·
1 Parent(s): 93701ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -0
README.md CHANGED
@@ -10,6 +10,8 @@ base_model:
10
 
11
  F2LLMs (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines.
12
 
 
 
13
  To encode a batch of sentences:
14
 
15
  ```python
@@ -40,6 +42,8 @@ def encode(sentences):
40
  embeddings = encode(sentences)
41
  ```
42
 
 
 
43
  To evaluate F2LLMs on MTEB (currently requires installing MTEB from source):
44
 
45
  ```python
@@ -59,3 +63,25 @@ model = mteb.get_model("codefuse-ai/F2LLM-1.7B", device="cuda:0")
59
  evaluation = mteb.MTEB(tasks=tasks)
60
  evaluation.run(model, encode_kwargs={"batch_size": 16})
61
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  F2LLMs (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines.
12
 
13
+ ## Usage
14
+
15
  To encode a batch of sentences:
16
 
17
  ```python
 
42
  embeddings = encode(sentences)
43
  ```
44
 
45
+ ## Evaluation
46
+
47
  To evaluate F2LLMs on MTEB (currently requires installing MTEB from source):
48
 
49
  ```python
 
63
  evaluation = mteb.MTEB(tasks=tasks)
64
  evaluation.run(model, encode_kwargs={"batch_size": 16})
65
  ```
66
+
67
+ ## Training
68
+
69
+ Training code is available in our [Github repo](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/F2LLM).
70
+
71
+ ## Citation
72
+
73
+ If you use the F2LLM models, data, or code, please cite the following technical report.
74
+
75
+ ```
76
+ @article{2025F2LLM,
77
+ title={F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data},
78
+ author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang},
79
+ journal = {CoRR},
80
+ volume = {abs/2510.02294},
81
+ year = {2025},
82
+ url = {https://doi.org/10.48550/arXiv.2510.02294},
83
+ doi = {10.48550/ARXIV.2510.02294},
84
+ eprinttype = {arXiv},
85
+ eprint = {2510.02294}
86
+ }
87
+ ```