Model Details

Base Model: laion/CLIP-ViT-H-14-laion2B-s32B-b79K
Architecture: Vision Transformer (ViT-H/14) + XLM-Roberta Large text encoder
Languages: Multilingual, with a focus on Ukrainian
Developed by: Yurii Laba and Volodymyr Mudriy, in affiliation with the Ukrainian Catholic University

This model is a fine-tuned OpenCLIP model that improves embedding stability for text-to-image retrieval under synonym substitution via synonym-augmented fine-tuning.

Data & Training

Fine-tuning was performed on the Multi30K-Ukrainian training set, extended with synonym-augmented captions, and optimized with a CLIP contrastive loss.

Augmentation: Each caption was expanded into several variants by substituting exactly one noun with a context-aware synonym.
Synonym Generation: Synonyms were produced using GPT-4o, ensuring semantic, morphological, and grammatical correctness.
Images: The paired image remained unchanged.
Final Corpus: Original Multi30K-Ukrainian training set + synonym-augmented captions.

Evaluation: Ukrainian Text-to-Image Retrieval (Multi30K-Ukrainian Test Set)

Model	Unpert. (UA)	Unpert. (ENG)	SSA-Dict	SSA-GPT-4o	SSA-Hybrid
OpenCLIP	32.1 / 54.3	41.6 / 65.7	7.6 / 39.3	10.9 / 44.0	16.8 / 49.0
Synonym FT	39.07 / 63.76	45.77 / 69.79	19.78 / 51.57	25.14 / 56.36	28.08 / 58.94

We evaluated the model on the Multi30K-Ukrainian test set, comparing baseline OpenCLIP with our synonym fine-tuned variant. Performance is reported as HIT@1 / HIT@5 (higher is better), showing the proportion of times the correct image was retrieved in the top-1 and top-5 results.

Unperturbed (UA): Original Ukrainian captions.
Unperturbed (ENG): Original English captions (baseline).
SSA-Dict: Synonym Substitution Attack using dictionary-based synonyms.
SSA-GPT-4o: Synonym Substitution Attack using GPT-4o-generated synonyms.
SSA-Hybrid: Mixed attack combining dictionary and GPT-4o synonyms.

Usage Example

To use this model:

Download the checkpoint:
ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k.pt
Load it in OpenCLIP:

import torch
from PIL import Image
import open_clip

device = "cuda" if torch.cuda.is_available() else "cpu"

pretrained_path = "ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k.pt"

# Load model & preprocessing
model, _, preprocess = open_clip.create_model_and_transforms('xlm-roberta-large-ViT-H-14', pretrained=pretrained_path)
model.to(device)
model.eval()
tokenizer = open_clip.get_tokenizer('xlm-roberta-large-ViT-H-14')

# Example inputs
image = preprocess(Image.open("ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/dog.jpg")).unsqueeze(0)
text = tokenizer(["діаграма", "собака", "кіт"])

# Encode & normalize
with torch.no_grad(), torch.autocast("cuda"):
  image_features = model.encode_image(image.to(device))
  text_features = model.encode_text(text.to(device))
  image_features /= image_features.norm(dim=-1, keepdim=True)
  text_features /= text_features.norm(dim=-1, keepdim=True)

# Compute similarity
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)

Citation

If you use this model in your work, please cite: TODO: will be added after EMNLP 2025

Downloads last month: -

Model tree for lang-uk/ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k

Base model

laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k

Finetuned

(1)

this model

lang-uk
/

ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k