Model Details

  • Base Model: laion/CLIP-ViT-H-14-laion2B-s32B-b79K
  • Architecture: Vision Transformer (ViT-H/14) + XLM-Roberta Large text encoder
  • Languages: Multilingual, with a focus on Ukrainian
  • Developed by: Yurii Laba and Volodymyr Mudriy, in affiliation with the Ukrainian Catholic University

This model is a fine-tuned OpenCLIP model that improves embedding stability for text-to-image retrieval under synonym substitution via synonym-augmented fine-tuning.

Data & Training

Fine-tuning was performed on the Multi30K-Ukrainian training set, extended with synonym-augmented captions, and optimized with a CLIP contrastive loss.

  • Augmentation: Each caption was expanded into several variants by substituting exactly one noun with a context-aware synonym.
  • Synonym Generation: Synonyms were produced using GPT-4o, ensuring semantic, morphological, and grammatical correctness.
  • Images: The paired image remained unchanged.
  • Final Corpus: Original Multi30K-Ukrainian training set + synonym-augmented captions.

Evaluation: Ukrainian Text-to-Image Retrieval (Multi30K-Ukrainian Test Set)

Model Unpert. (UA) Unpert. (ENG) SSA-Dict SSA-GPT-4o SSA-Hybrid
OpenCLIP 32.1 / 54.3 41.6 / 65.7 7.6 / 39.3 10.9 / 44.0 16.8 / 49.0
Synonym FT 39.07 / 63.76 45.77 / 69.79 19.78 / 51.57 25.14 / 56.36 28.08 / 58.94

We evaluated the model on the Multi30K-Ukrainian test set, comparing baseline OpenCLIP with our synonym fine-tuned variant. Performance is reported as HIT@1 / HIT@5 (higher is better), showing the proportion of times the correct image was retrieved in the top-1 and top-5 results.

  • Unperturbed (UA): Original Ukrainian captions.
  • Unperturbed (ENG): Original English captions (baseline).
  • SSA-Dict: Synonym Substitution Attack using dictionary-based synonyms.
  • SSA-GPT-4o: Synonym Substitution Attack using GPT-4o-generated synonyms.
  • SSA-Hybrid: Mixed attack combining dictionary and GPT-4o synonyms.

Usage Example

To use this model:

  1. Download the checkpoint:
    ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k.pt

  2. Load it in OpenCLIP:

import torch
from PIL import Image
import open_clip

device = "cuda" if torch.cuda.is_available() else "cpu"

pretrained_path = "ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k.pt"

# Load model & preprocessing
model, _, preprocess = open_clip.create_model_and_transforms('xlm-roberta-large-ViT-H-14', pretrained=pretrained_path)
model.to(device)
model.eval()
tokenizer = open_clip.get_tokenizer('xlm-roberta-large-ViT-H-14')

# Example inputs
image = preprocess(Image.open("ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/dog.jpg")).unsqueeze(0)
text = tokenizer(["діаграма", "собака", "кіт"])

# Encode & normalize
with torch.no_grad(), torch.autocast("cuda"):
  image_features = model.encode_image(image.to(device))
  text_features = model.encode_text(text.to(device))
  image_features /= image_features.norm(dim=-1, keepdim=True)
  text_features /= text_features.norm(dim=-1, keepdim=True)

# Compute similarity
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)

Citation

If you use this model in your work, please cite: TODO: will be added after EMNLP 2025

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lang-uk/ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k

Dataset used to train lang-uk/ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k