YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Yuan 3.0 Multimodal Foundation Model


GitHub ModelScope Twitter Follow arXiv

Latest Updates πŸŽ‰πŸŽ‰

  • [2025-12-30] Released Yuan 3.0-40B Multimodal Large Language Model, a high-performance model for enterprise-grade application scenarios: Yuan3.0 Flash

1. Introduction

Yuan 3.0 Flash, developed by the YuanLab.ai team, is a 40B parameter multimodal foundation model that employs a Mixture of Experts (MoE) architecture, activating only approximately 3.7B parameters per inference. Through innovative reinforcement learning training methods (RAPO), it significantly reduces inference token consumption while improving reasoning accuracy, exploring the innovative path of "less computation, higher intelligence" for large language models. We have also released the technical report for the Yuan3.0 model, where you can find more detailed technical information and evaluation results.

Fig.1: Yuan3.0 Multimodal Large Language Model Architecture

Core Features

  • πŸš€ Efficient Inference: Reduces inference token consumption by up to 75%, significantly lowering costs
  • 🎯 Enterprise-Grade Optimization: Deeply optimized for enterprise scenarios such as RAG, document understanding, and table analysis
  • 🎨 Multimodal Support: Supports text, image, table, document and other multimodal inputs
  • πŸ“š Long Context: Supports 128K context length, achieving 100% accuracy in "Needle in a Haystack" tests
  • ⚑ Ready-to-Use Intelligence: Default inference mode meets the needs of most enterprise scenarios

2. Performance

Yuan 3.0 Flash outperforms GPT-5.1 in enterprise-grade RAG, multimodal retrieval, table understanding, summary generation and other tasks. With 40B parameters, it achieves the reasoning accuracy of 235B/671B models while reducing token consumption by 50%-75%, providing enterprises with high-performance, low-cost large language model solutions.

Fig.1: Yuan3.0 Multimodal Large Language Model Architecture

3. Core Technology

RAPO Reinforcement Learning Algorithm

The innovative Reflection-aware Adaptive Policy Optimization (RAPO) algorithm, through the Reflection Inhibition Reward Mechanism (RIRM):

  • βœ… Identifies the key point where the correct answer is first obtained
  • 🎯 Suppresses subsequent redundant reasoning behavior
  • πŸ“‰ Improves accuracy while reducing inference token count by approximately 75%
Training Method AIME 2024 Accuracy Avg Output Length MATH-500 Accuracy Avg Output Length
Yuan3.0 Flash (40B) SFT 31.45% 13,656 tokens 83.20% 3,362 tokens
RL+DAPO length-penalty 46.35% 13,781 tokens 89.06% 3,974 tokens
RL+RIRM 47.92% 7,505 tokens 89.47% 1,777 tokens

4. Model Download

We provide download links for multiple model formats:

Model Parameters Precision Sequence Length Model Format Download Link
Yuan3.0 Flash 40B 16bit 128K HuggingFace ModelScope | HuggingFace | WiseModel
Yuan3.0 Flash 4bit 40B 4bit 128K HuggingFace ModelScope | HuggingFace | WiseModel

5. Evaluation Results

5.1 Text-based RAG Evaluation: ChatRAG πŸ†

Yuan 3.0 Flash leads DeepSeek-V3, DeepSeek-R1 and other large language models in average accuracy across 10 evaluation tasks in the industry-standard RAG benchmark ChatRAG.

Model Average Accuracy Comparison

Models Avg All D2D QuAC QReCC CoQA DoQA CFQA SQA TCQA HDial INSCIT
DeepSeek-V3 50.47 31.59 28.86 49.31 76.98 26.11 83.49 82.13 46.69 47.43 32.08
DeepSeek-V3.23 49.67 34.30 28.09 49.97 77.29 29.46 72.85 79.48 44.64 47.99 32.64
OpenAI GPT-4o 50.54 32.76 26.56 49.30 76.11 28.78 81.85 81.14 49.75 41.29 26.69
OpenAI GPT-o3 44.06 23.05 20.82 40.42 69.42 18.56 67.75 86.71 45.85 41.29 26.69
DeepSeek-R1 39.42 21.46 22.23 42.41 62.53 24.68 81.48 82.06 30.74 37.97 28.68
OpenAI GPT-5.1 46.10 28.24 23.16 45.43 68.84 20.88 73.05 81.32 44.70 45.39 29.95
Yuan3.0 Flash 64.47 49.82 53.79 57.08 90.93 59.99 74.40 87.52 66.31 68.45 36.40

β€’ Long Context Tests (D2D, QuAC, QReCC)
β€’ Wikipedia Retrieval Tests (TCQA, INSCIT)
β€’ Short Text & Structured Context Tests (CoQA, DoQA, CFQA, SQA, HDial)


5.2 Multimodal RAG Evaluation: Docmatix πŸ†

Yuan3.0 Flash leads Claude3.5, OpenAI GPT-4o, o3 and other models in the multimodal RAG benchmark Docmatix, with accuracy performance only second to GPT-5.1.

Model Average Accuracy Comparison

Models Avg.
Qwen2.5-VL-72B-Instruct 59.75
InternVL3-78B 42.99
Claude3.5-Sonnet 42.55
OpenAI GPT-4o 56.79
OpenAI GPT-o3 45.57
OpenAI GPT-4V 60.10
OpenAI GPT-5.1 48.52
Yuan3.0 Flash 65.07

Docmatix - Evaluates the model's ability to retrieve information, correlate, and accurately answer questions across text, tables, images and other multimodal content in multi-page complex documents.


5.3 Multimodal Complex Table Content Analysis Evaluation: MMTab πŸ†

Multimodal table understanding is an important application scenario in enterprise office automation. Yuan3.0 Flash achieves leading average accuracy on 15 evaluation tasks in the industry-standard multimodal complex table understanding benchmark MMTab, surpassing OpenAI's GPT-5.1.

Model Average Accuracy Comparison

Models Avg. TABMWP WTQ WTQ HiTab TAT-QA FeTaQAU TabFact InfoTabs HiTab_T2T Rotowire WikiBIO TSD_Row TSD_Col TCE TCL MCD RCE
Zhipu GLM-4.5V 52.00 88.21 77.42 51.52 62.69 5.25 89.44 79.48 5.17 4.48 2.69 47.40 89.70 52.74 50.84 43.47 50.77 82.79
OpenAI GPT-4V 29.90 60.50 48.00 27.50 32.50 11.04 45.50 65.60 2.98 4.23 1.94 19.00 38.00 14.36 27.91 3.50 48.52 57.14
OpenAI GPT-5.1 55.15 64.95 60.77 77.77 61.37 8.70 52.81 64.30 44.16 17.81 11.95 96.60 62.10 86.43 44.66 72.46 53.58 57.20
Yuan3.0 Flash 58.29 95.09 68.23 69.80 69.17 28.42 87.32 83.50 13.30 14.74 17.26 46.60 82.80 56.77 56.98 65.20 62.07 73.67

5.4 Text Summarization Generation Evaluation: SummEval πŸ†

Summarization generation is a core requirement for historical information compression in intelligent agent applications. Yuan 3.0 achieves leading average accuracy in the industry-standard summarization generation benchmark SummEval across three major capabilities: lexical overlap, semantic similarity, and factual consistency, surpassing the DeepSeek-V3 large language model.

Model Average Accuracy Comparison

Models Avg. Lexical Overlap
ROUGE-1
Lexical Overlap
ROUGE-2
Semantic Similarity
BERTScore
Factual Consistency
SummaC
DeepSeek-V3 59.28 25.50 9.20 86.30 68.20
DeepSeek-V3.2 51.36 33.30 11.92 85.61 41.76
Gemini-2.0-Flash 45.35 24.80 8.70 85.70 29.50
Claude-3.5-Sonnet 45.43 24.10 8.30 85.20 30.70
OpenAI GPT-4o 46.53 25.00 8.90 85.90 32.50
OpenAI GPT-5.1 49.44 27.48 10.16 84.63 40.50
Yuan3.0 Flash 59.31 51.32 28.32 89.99 45.34
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support