Model Details
- Base Model:
laion/CLIP-ViT-H-14-laion2B-s32B-b79K - Architecture: Vision Transformer (ViT-H/14) + XLM-Roberta Large text encoder
- Languages: Multilingual, with a focus on Ukrainian
- Developed by: Yurii Laba and Volodymyr Mudriy, in affiliation with the Ukrainian Catholic University
This model is a fine-tuned OpenCLIP model that improves embedding stability for text-to-image retrieval under synonym substitution via synonym-augmented fine-tuning.
Data & Training
Fine-tuning was performed on the Multi30K-Ukrainian training set, extended with synonym-augmented captions, and optimized with a CLIP contrastive loss.
- Augmentation: Each caption was expanded into several variants by substituting exactly one noun with a context-aware synonym.
- Synonym Generation: Synonyms were produced using GPT-4o, ensuring semantic, morphological, and grammatical correctness.
- Images: The paired image remained unchanged.
- Final Corpus: Original Multi30K-Ukrainian training set + synonym-augmented captions.
Evaluation: Ukrainian Text-to-Image Retrieval (Multi30K-Ukrainian Test Set)
| Model | Unpert. (UA) | Unpert. (ENG) | SSA-Dict | SSA-GPT-4o | SSA-Hybrid |
|---|---|---|---|---|---|
| OpenCLIP | 32.1 / 54.3 | 41.6 / 65.7 | 7.6 / 39.3 | 10.9 / 44.0 | 16.8 / 49.0 |
| Synonym FT | 39.07 / 63.76 | 45.77 / 69.79 | 19.78 / 51.57 | 25.14 / 56.36 | 28.08 / 58.94 |
We evaluated the model on the Multi30K-Ukrainian test set, comparing baseline OpenCLIP with our synonym fine-tuned variant. Performance is reported as HIT@1 / HIT@5 (higher is better), showing the proportion of times the correct image was retrieved in the top-1 and top-5 results.
- Unperturbed (UA): Original Ukrainian captions.
- Unperturbed (ENG): Original English captions (baseline).
- SSA-Dict: Synonym Substitution Attack using dictionary-based synonyms.
- SSA-GPT-4o: Synonym Substitution Attack using GPT-4o-generated synonyms.
- SSA-Hybrid: Mixed attack combining dictionary and GPT-4o synonyms.
Usage Example
To use this model:
Download the checkpoint:
ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k.ptLoad it in OpenCLIP:
import torch
from PIL import Image
import open_clip
device = "cuda" if torch.cuda.is_available() else "cpu"
pretrained_path = "ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k.pt"
# Load model & preprocessing
model, _, preprocess = open_clip.create_model_and_transforms('xlm-roberta-large-ViT-H-14', pretrained=pretrained_path)
model.to(device)
model.eval()
tokenizer = open_clip.get_tokenizer('xlm-roberta-large-ViT-H-14')
# Example inputs
image = preprocess(Image.open("ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/dog.jpg")).unsqueeze(0)
text = tokenizer(["діаграма", "собака", "кіт"])
# Encode & normalize
with torch.no_grad(), torch.autocast("cuda"):
image_features = model.encode_image(image.to(device))
text_features = model.encode_text(text.to(device))
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Compute similarity
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
Citation
If you use this model in your work, please cite: TODO: will be added after EMNLP 2025
- Downloads last month
- -