--- license: apache-2.0 pipeline_tag: image-text-to-text library_name: transformers --- # CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning 📖Paper | 💻Code | 🤗CapRL-3B Model | 🤗CapRL-2M Dataset |🤗CapRL Collection **CapRL-Eval-3B** is the model used for answering questions based on captions, and it is a finetuned version of Qwen2.5-VL-3B. When dealing with tasks such as ChartQA (not multiple-choice questions), it provides more stable output formatting. ## Introduction We are excited to introduce CapRL-3B, a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B. This is the first study of applying Reinforcement Learning with Verifiable Rewards for the open-ended and subjective image captioning task. Unlike traditional Supervised Fine-Tuning, which can lead to models memorizing a limited set of annotated captions, our method allows the model to explore and generate a broader range of creative and general descriptions. CapRL is a new training paradigm featuring a decoupled two-stage pipeline. The initial stage uses LVLMs to generate rich and accurate captions. Subsequently, the second stage evaluates caption quality by using a vision-only LLM to perform the QA task. We also created a specific QA curation pipeline to ensure the quality of the questions and answers used for the second stage. By employing CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-3B.

Main Results on GPT2

Main Results on GPT2

## Key Features * **Remarkable visual understanding for Chart, Infographics and Document**: CapRL-3B achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B. * **Well-organized output**: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand. * **Detailed description for natural images**: The outputs of CapRL-3B can perfectly cover all valid visual information while containing fewer hallucinations. ## Usage If you want to use **CapRL-3B** for captioning, you can directly follow the exact same inference approach as in [Qwen2.5-VL-series](https://github.com/QwenLM/Qwen3-VL/tree/d2240f11656bfe404b9ba56db4e51cd09f522ff1). We recommend using **vLLM** to speed up inference. ### Start an OpenAI API Service Run the command below to start an OpenAI-compatible API service: ```bash vllm serve "/PATH/CapRL-3B" \ --trust-remote-code \ --tensor-parallel-size=1 \ --pipeline-parallel-size=1 \ --gpu_memory_utilization=0.95 \ --served-model-name=caprl \ --port 8000 \ --host 0.0.0.0 ``` Then you can use the chat API as below: (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details): ```python import base64 from openai import OpenAI # Set OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) image_path = "/path/to/local/image.png" with open(image_path, "rb") as f: encoded_image = base64.b64encode(f.read()) encoded_image_text = encoded_image.decode("utf-8") base64_qwen = f"data:image;base64,{encoded_image_text}" chat_response = client.chat.completions.create( model="caprl", messages=[ {"role": "system", "content": "You are a helpful assistant."}, { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": base64_qwen }, }, {"type": "text", "text": "What is the text in the illustrate?"}, ], }, ], temperature=1.0, max_tokens=max_tokens, top_p=1.0, extra_body={ "repetition_penalty": 1.0, }, ) print("Chat response:", chat_response) ``` ## Cases

Main Results on GPT2

Main Results on GPT2

Main Results on GPT2

Main Results on GPT2