--- base_model: Qwen/Qwen3-VL-2B-Instruct library_name: transformers model_name: Qwen3-VL-2B-catmus-medieval tags: - generated_from_trainer - sft - trl - vision-language - ocr - transcription - medieval - latin - manuscript licence: license datasets: - CATMuS/medieval --- # Model Card for Qwen3-VL-2B-catmus-medieval This model is a fine-tuned version of [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) for transcribing line-level medieval manuscripts from images. It has been trained using [TRL](https://github.com/huggingface/trl) on the [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval) dataset. ## Model Description This vision-language model specializes in transcribing text from images of line-level medieval manuscripts. Given an image of manuscript text, the model generates the corresponding transcription. ## Performance The model was evaluated on 100 examples from the [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval) dataset (test split). ### Metrics | Metric | Base Model | Fine-tuned Model | Improvement | |--------|-----------|------------------|-------------| | **Character Error Rate (CER)** | 1.0815 (108.15%) | 0.2779 (27.79%) | **+74.30%** | | **Word Error Rate (WER)** | 1.7386 (173.86%) | 0.7043 (70.43%) | **+59.49%** | ### Sample Predictions Here are some example transcriptions comparing the base model and fine-tuned model: **Example 1:** - **Reference:** paulꝯ ad thessalonicenses .iii. - **Base Model:** Paulus ad the Malomancis · iii. - **Fine-tuned Model:** Paulꝰ ad thessalonensis .iii. **Example 2:** - **Reference:** acceptad mi humilde seruicio. e dissipad. e plantad en el - **Base Model:** acceptad mi humilde servicio, e dissipad, e plantad en el - **Fine-tuned Model:** acceptad mi humilde seruicio, e dissipad, e plantad en el **Example 3:** - **Reference:** ꝙ mattheus illam dictionem ponat - **Base Model:** p mattheus illam dictoneum proa - **Fine-tuned Model:** ꝑ mattheus illam dictione in ponat **Example 4:** - **Reference:** Elige ꝗd uoueas. eadẽ ħ ꝗꝗ sama ferebat. - **Base Model:** f. ligeq d uonear. eade h q q fama ferebat. - **Fine-tuned Model:** f liges ꝗd uonear. eadẽ li ꝗq tanta ferebat᷑. **Example 5:** - **Reference:** a prima coniugatione ue - **Base Model:** Grigimacopissagazione-ve - **Fine-tuned Model:** a ꝑrũt̾tacõnueꝰatione. ne ## Quick start ```python from transformers import AutoProcessor, Qwen3VLForConditionalGeneration from peft import PeftModel from PIL import Image # Load model and processor base_model = "Qwen/Qwen3-VL-2B-Instruct" adapter_model = "small-models-for-glam/Qwen3-VL-2B-catmus" model = Qwen3VLForConditionalGeneration.from_pretrained( base_model, torch_dtype="auto", device_map="auto" ) model = PeftModel.from_pretrained(model, adapter_model) processor = AutoProcessor.from_pretrained(base_model) # Load your image image = Image.open("path/to/your/manuscript_image.jpg") # Prepare the message messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Transcribe the text shown in this image."}, ], }, ] # Generate transcription inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" ).to(model.device) generated_ids = model.generate(**inputs, max_new_tokens=256) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] transcription = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(transcription) ``` ## Use Cases This model is designed for: - Transcribing line-level medieval manuscripts - Digitizing historical manuscripts - Supporting historical research and archival work - Optical Character Recognition (OCR) for specialized historical texts ## Training procedure This model was fine-tuned using Supervised Fine-Tuning (SFT) with LoRA adapters on the Qwen3-VL-2B-Instruct base model. ### Training Data The model was trained on [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval), a dataset containing images of line-level medieval manuscripts with corresponding text transcriptions. ### Training Configuration - **Base Model**: Qwen/Qwen3-VL-2B-Instruct - **Training Method**: Supervised Fine-Tuning (SFT) with LoRA - **LoRA Configuration**: - Rank (r): 16 - Alpha: 32 - Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj - Dropout: 0.1 - **Training Arguments**: - Epochs: 3 - Batch size per device: 2 - Gradient accumulation steps: 4 - Learning rate: 5e-05 - Optimizer: AdamW - Mixed precision: FP16 ### Framework versions - TRL: 0.23.0 - Transformers: 4.57.1 - Pytorch: 2.8.0 - Datasets: 4.1.1 - Tokenizers: 0.22.1 ## Limitations - The model is specialized for line-level medieval manuscripts and may not perform well on other types of text or images - Performance may vary depending on image quality, resolution, and handwriting style - The model has been trained on a specific dataset and may require fine-tuning for other manuscript collections ## Citations If you use this model, please cite the base model and training framework: ### Qwen3-VL ```bibtex @article{Qwen3-VL, title={Qwen3-VL: Large Vision Language Models Pretrained on Massive Data}, author={Qwen Team}, journal={arXiv preprint}, year={2024} } ``` ### TRL (Transformer Reinforcement Learning) ```bibtex @misc{vonwerra2022trl, title = {{TRL: Transformer Reinforcement Learning}}, author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'{'e}}dec}, year = 2020, journal = {GitHub repository}, publisher = {GitHub}, howpublished = {\url{https://github.com/huggingface/trl}} } ``` --- *README generated automatically on 2025-10-24 10:49:05*