---
language:
- en
license: apache-2.0
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- dataset_size:10000
- loss:MultipleNegativesRankingLoss
base_model: google/siglip-base-patch16-512
widget:
- source_sentence: A man standing next to a little girl riding a horse.
  sentences:
  - The woman is working on her computer at the desk.
  - A young man holding an umbrella next to a herd of cattle.
  - 'a person sitting at a desk with a keyboard and monitor '
- source_sentence: 'A car at an intersection while a man is crossing the street. '
  sentences:
  - A plane that is flying in the air.
  - a small girl sitting on a chair holding a white bear
  - A young toddler walks across the grass in a park.
- source_sentence: A lady riding her bicycle on the side of a street.
  sentences:
  - Flowers hang from a small decorative post in a yard.
  - Flowers in a clear vase sitting on a table.
  - The toilet is near the door in the bathroom.
- source_sentence: 'A group of zebras standing beside each other in the desert. '
  sentences:
  - The bathroom is clean and ready for us to use.
  - A woman throwing a frisbee as a child looks on.
  - a bird with a pink eye is sitting on a branch in the woods.
- source_sentence: A large desk by a window is neatly arranged.
  sentences:
  - An old toilet sits in dirt with a helmet on top.
  - A lady sitting at an enormous dining table with lots of food.
  - A long hot dog on a plate on a table.
datasets:
- jxie/coco_captions
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
co2_eq_emissions:
  emissions: 14.565152777100327
  energy_consumed: 0.054424347688532056
  source: codecarbon
  training_type: fine-tuning
  on_cloud: false
  cpu_model: 13th Gen Intel(R) Core(TM) i7-13700K
  ram_total_size: 31.777088165283203
  hours_used: 0.169
  hardware_used: 1 x NVIDIA GeForce RTX 3090
model-index:
- name: Google SigLIP (512x512 resolution) model trained on COCO Captions
  results:
  - task:
      type: information-retrieval
      name: Information Retrieval
    dataset:
      name: coco eval
      type: coco-eval
    metrics:
    - type: cosine_accuracy@1
      value: 0.755
      name: Cosine Accuracy@1
    - type: cosine_accuracy@3
      value: 0.944
      name: Cosine Accuracy@3
    - type: cosine_accuracy@5
      value: 0.975
      name: Cosine Accuracy@5
    - type: cosine_accuracy@10
      value: 0.992
      name: Cosine Accuracy@10
    - type: cosine_precision@1
      value: 0.755
      name: Cosine Precision@1
    - type: cosine_precision@3
      value: 0.31466666666666665
      name: Cosine Precision@3
    - type: cosine_precision@5
      value: 0.19500000000000003
      name: Cosine Precision@5
    - type: cosine_precision@10
      value: 0.09920000000000001
      name: Cosine Precision@10
    - type: cosine_recall@1
      value: 0.755
      name: Cosine Recall@1
    - type: cosine_recall@3
      value: 0.944
      name: Cosine Recall@3
    - type: cosine_recall@5
      value: 0.975
      name: Cosine Recall@5
    - type: cosine_recall@10
      value: 0.992
      name: Cosine Recall@10
    - type: cosine_ndcg@10
      value: 0.8860228540949219
      name: Cosine Ndcg@10
    - type: cosine_mrr@10
      value: 0.8505285714285713
      name: Cosine Mrr@10
    - type: cosine_map@100
      value: 0.8508208051006964
      name: Cosine Map@100
  - task:
      type: information-retrieval
      name: Information Retrieval
    dataset:
      name: coco test
      type: coco-test
    metrics:
    - type: cosine_accuracy@1
      value: 0.754
      name: Cosine Accuracy@1
    - type: cosine_accuracy@3
      value: 0.935
      name: Cosine Accuracy@3
    - type: cosine_accuracy@5
      value: 0.976
      name: Cosine Accuracy@5
    - type: cosine_accuracy@10
      value: 0.992
      name: Cosine Accuracy@10
    - type: cosine_precision@1
      value: 0.754
      name: Cosine Precision@1
    - type: cosine_precision@3
      value: 0.31166666666666665
      name: Cosine Precision@3
    - type: cosine_precision@5
      value: 0.1952
      name: Cosine Precision@5
    - type: cosine_precision@10
      value: 0.09920000000000001
      name: Cosine Precision@10
    - type: cosine_recall@1
      value: 0.754
      name: Cosine Recall@1
    - type: cosine_recall@3
      value: 0.935
      name: Cosine Recall@3
    - type: cosine_recall@5
      value: 0.976
      name: Cosine Recall@5
    - type: cosine_recall@10
      value: 0.992
      name: Cosine Recall@10
    - type: cosine_ndcg@10
      value: 0.8848518154761025
      name: Cosine Ndcg@10
    - type: cosine_mrr@10
      value: 0.8490460317460323
      name: Cosine Mrr@10
    - type: cosine_map@100
      value: 0.849432976701497
      name: Cosine Map@100
---
# Google SigLIP (512x512 resolution) model trained on COCO Captions
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [google/siglip-base-patch16-512](https://huggingface.co/google/siglip-base-patch16-512) on the [coco_captions](https://huggingface.co/datasets/jxie/coco_captions) dataset. It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
## Model Details
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [google/siglip-base-patch16-512](https://huggingface.co/google/siglip-base-patch16-512) 
- **Maximum Sequence Length:** None tokens
- **Output Dimensionality:** None dimensions
- **Similarity Function:** Cosine Similarity
- **Training Dataset:**
    - [coco_captions](https://huggingface.co/datasets/jxie/coco_captions)
- **Language:** en
- **License:** apache-2.0
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
### Full Model Architecture
```
SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'get_text_features', 'method_output_name': None}, 'image': {'method': 'get_image_features', 'method_output_name': None}}, 'module_output_name': 'sentence_embedding', 'architecture': 'SiglipModel'})
)
```
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("tomaarsen/google-siglip-base-coco")
# Run inference
sentences = [
    'A large desk by a window is neatly arranged.',
    'A long hot dog on a plate on a table.',
    'A lady sitting at an enormous dining table with lots of food.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.1848, 0.1578],
#         [0.1848, 1.0000, 0.5058],
#         [0.1578, 0.5058, 1.0000]])
```
## Evaluation
### Metrics
#### Information Retrieval
* Datasets: `coco-eval` and `coco-test`
* Evaluated with [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
| Metric              | coco-eval | coco-test  |
|:--------------------|:----------|:-----------|
| cosine_accuracy@1   | 0.755     | 0.754      |
| cosine_accuracy@3   | 0.944     | 0.935      |
| cosine_accuracy@5   | 0.975     | 0.976      |
| cosine_accuracy@10  | 0.992     | 0.992      |
| cosine_precision@1  | 0.755     | 0.754      |
| cosine_precision@3  | 0.3147    | 0.3117     |
| cosine_precision@5  | 0.195     | 0.1952     |
| cosine_precision@10 | 0.0992    | 0.0992     |
| cosine_recall@1     | 0.755     | 0.754      |
| cosine_recall@3     | 0.944     | 0.935      |
| cosine_recall@5     | 0.975     | 0.976      |
| cosine_recall@10    | 0.992     | 0.992      |
| **cosine_ndcg@10**  | **0.886** | **0.8849** |
| cosine_mrr@10       | 0.8505    | 0.849      |
| cosine_map@100      | 0.8508    | 0.8494     |
## Training Details
### Training Dataset
#### coco_captions
* Dataset: [coco_captions](https://huggingface.co/datasets/jxie/coco_captions) at [a2ed90d](https://huggingface.co/datasets/jxie/coco_captions/tree/a2ed90d49b61dd13dd71f399c70f5feb897f8bec)
* Size: 10,000 training samples
* Columns: image and caption
* Approximate statistics based on the first 1000 samples:
  |         | image                             | caption                                                                                         |
  |:--------|:----------------------------------|:------------------------------------------------------------------------------------------------|
  | type    | PIL.JpegImagePlugin.JpegImageFile | string                                                                                          |
  | details | 
A woman wearing a net on her head cutting a cake.      |
  | A woman cutting a large white sheet cake.              |
  | A woman wearing a hair net cutting a large sheet cake. |
* Loss: [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
  ```json
  {
      "scale": 20.0,
      "similarity_fct": "cos_sim",
      "gather_across_devices": false
  }
  ```
### Evaluation Dataset
#### coco_captions
* Dataset: [coco_captions](https://huggingface.co/datasets/jxie/coco_captions) at [a2ed90d](https://huggingface.co/datasets/jxie/coco_captions/tree/a2ed90d49b61dd13dd71f399c70f5feb897f8bec)
* Size: 1,000 evaluation samples
* Columns: image and caption
* Approximate statistics based on the first 1000 samples:
  |         | image                             | caption                                                                                         |
  |:--------|:----------------------------------|:------------------------------------------------------------------------------------------------|
  | type    | PIL.JpegImagePlugin.JpegImageFile | string                                                                                          |
  | details | A child holding a flowered umbrella and petting a yak.              |
  | A young man holding an umbrella next to a herd of cattle.           |
  | a young boy barefoot holding an umbrella touching the horn of a cow |
* Loss: [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
  ```json
  {
      "scale": 20.0,
      "similarity_fct": "cos_sim",
      "gather_across_devices": false
  }
  ```
### Training Hyperparameters
#### Non-Default Hyperparameters
- `eval_strategy`: steps
- `per_device_train_batch_size`: 16
- `per_device_eval_batch_size`: 16
- `learning_rate`: 2e-05
- `num_train_epochs`: 1
- `warmup_ratio`: 0.1
- `bf16`: True
- `batch_sampler`: no_duplicates
#### All Hyperparameters