Update README.md
Browse files
README.md
CHANGED
|
@@ -2608,28 +2608,29 @@ model-index:
|
|
| 2608 |
---
|
| 2609 |
<h1 align="center">GIST Large Embedding v0</h1>
|
| 2610 |
|
| 2611 |
-
*
|
| 2612 |
|
| 2613 |
The model is fine-tuned on top of the [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
|
| 2614 |
|
| 2615 |
The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
|
| 2616 |
|
| 2617 |
-
Technical
|
|
|
|
| 2618 |
|
| 2619 |
# Data
|
| 2620 |
|
| 2621 |
-
The dataset used is a compilation of the MEDI
|
| 2622 |
|
| 2623 |
- Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
|
| 2624 |
- Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
|
| 2625 |
|
| 2626 |
-
The dataset contains a `task_type` key which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
|
| 2627 |
|
| 2628 |
The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
|
| 2629 |
|
| 2630 |
The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
|
| 2631 |
|
| 2632 |
-
The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID, which could have caused the observed performance degradation.
|
| 2633 |
|
| 2634 |
# Usage
|
| 2635 |
|
|
@@ -2639,7 +2640,7 @@ The model can be easily loaded using the Sentence Transformers library.
|
|
| 2639 |
import torch.nn.functional as F
|
| 2640 |
from sentence_transformers import SentenceTransformer
|
| 2641 |
|
| 2642 |
-
revision = None # Replace with the specific revision to ensure reproducibility
|
| 2643 |
|
| 2644 |
model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)
|
| 2645 |
|
|
@@ -2671,13 +2672,29 @@ Checkpoint step = 171000
|
|
| 2671 |
Contrastive loss temperature = 0.01
|
| 2672 |
```
|
| 2673 |
|
| 2674 |
-
Specific training details and strategies will be published shortly.
|
| 2675 |
|
| 2676 |
# Evaluation
|
| 2677 |
|
| 2678 |
The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
|
| 2679 |
|
| 2680 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2681 |
# Acknowledgements
|
| 2682 |
|
| 2683 |
This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.
|
|
|
|
| 2608 |
---
|
| 2609 |
<h1 align="center">GIST Large Embedding v0</h1>
|
| 2610 |
|
| 2611 |
+
*GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning*
|
| 2612 |
|
| 2613 |
The model is fine-tuned on top of the [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
|
| 2614 |
|
| 2615 |
The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
|
| 2616 |
|
| 2617 |
+
Technical paper: [GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning](https://arxiv.org/abs/2402.16829)
|
| 2618 |
+
|
| 2619 |
|
| 2620 |
# Data
|
| 2621 |
|
| 2622 |
+
The dataset used is a compilation of the MEDI and MTEB Classification training datasets. Third-party datasets may be subject to additional terms and conditions under their associated licenses. A HuggingFace Dataset version of the compiled dataset, and the specific revision used to train the model, is available:
|
| 2623 |
|
| 2624 |
- Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
|
| 2625 |
- Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
|
| 2626 |
|
| 2627 |
+
The dataset contains a `task_type` key, which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
|
| 2628 |
|
| 2629 |
The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
|
| 2630 |
|
| 2631 |
The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
|
| 2632 |
|
| 2633 |
+
The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID-19, which could have caused the observed performance degradation. We found some evidence, detailed in the paper, that thematic coverage of the fine-tuning data can affect downstream performance.
|
| 2634 |
|
| 2635 |
# Usage
|
| 2636 |
|
|
|
|
| 2640 |
import torch.nn.functional as F
|
| 2641 |
from sentence_transformers import SentenceTransformer
|
| 2642 |
|
| 2643 |
+
revision = None # Replace with the specific revision to ensure reproducibility if the model is updated.
|
| 2644 |
|
| 2645 |
model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)
|
| 2646 |
|
|
|
|
| 2672 |
Contrastive loss temperature = 0.01
|
| 2673 |
```
|
| 2674 |
|
|
|
|
| 2675 |
|
| 2676 |
# Evaluation
|
| 2677 |
|
| 2678 |
The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
|
| 2679 |
|
| 2680 |
|
| 2681 |
+
# Citation
|
| 2682 |
+
|
| 2683 |
+
Please cite our work if you use GISTEmbed or the datasets we published in your projects or research. 🤗
|
| 2684 |
+
|
| 2685 |
+
```
|
| 2686 |
+
@article{solatorio2024gistembed,
|
| 2687 |
+
title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
|
| 2688 |
+
author={Aivin V. Solatorio},
|
| 2689 |
+
journal={arXiv preprint arXiv:2402.16829},
|
| 2690 |
+
year={2024},
|
| 2691 |
+
URL={https://arxiv.org/abs/2402.16829}
|
| 2692 |
+
eprint={2402.16829},
|
| 2693 |
+
archivePrefix={arXiv},
|
| 2694 |
+
primaryClass={cs.LG}
|
| 2695 |
+
}
|
| 2696 |
+
```
|
| 2697 |
+
|
| 2698 |
# Acknowledgements
|
| 2699 |
|
| 2700 |
This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.
|