Viktor Cerny
Nazzaroth2
		AI & ML interests
Machine Translation (Focus English-Japanese)
		
		Organizations
None yet
data synthesis
			
			
	
	OCR
			
			
	
	- 
	
	
	
Gemma 3 Technical Report
Paper • 2503.19786 • Published • 54 - 
	
	
	
Kimi-VL Technical Report
Paper • 2504.07491 • Published • 132 - 
	
	
	
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 300 - 
	
	
	
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Paper • 2504.09925 • Published • 38 
VLM RL Reasoning
			
			
	
	- 
	
	
	
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement
Paper • 2503.17352 • Published • 24 - 
	
	
	
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation
Paper • 2503.16660 • Published • 72 - 
	
	
	
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Paper • 2503.18931 • Published • 30 - 
	
	
	
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
Paper • 2503.13964 • Published • 20 
llm_compression
			
			
	
	Loras
			
			
	
	t2i consistency works
			
			
	
	small_or_multimodal_llm
			
			
	
	long_context
			
			
	
	models to test out
			
			
	
	RL_Papers in general
			
			
	
	- 
	
	
	
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
Paper • 2504.08672 • Published • 55 - 
	
	
	
A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis
Paper • 2504.12322 • Published • 28 - 
	
	
	
Learning to Reason under Off-Policy Guidance
Paper • 2504.14945 • Published • 88 - 
	
	
	
TTRL: Test-Time Reinforcement Learning
Paper • 2504.16084 • Published • 120 
imageGen
			
			
	
	- 
	
	
	
Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models
Paper • 2503.18446 • Published • 12 - 
	
	
	
Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models
Paper • 2503.20240 • Published • 22 - 
	
	
	
BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation
Paper • 2503.20672 • Published • 14 - 
	
	
	
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
Paper • 2503.20198 • Published • 4 
LLM-External_information
			
			
	
	- 
	
	
	
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Paper • 2310.11511 • Published • 78 - 
	
	
	
Improving Text Embeddings with Large Language Models
Paper • 2401.00368 • Published • 82 - 
	
	
	
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Paper • 2404.05961 • Published • 66 
LLM_Reasoning-ErrorCorrection
			
			
	
	3D (nerfs, gaussians, generation etc.)
			
			
	
	videogames_roleplay
			
			
	
	manga_translation
			
			
	
	- 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 - 
	
	
	
PALO: A Polyglot Large Multimodal Model for 5B People
Paper • 2402.14818 • Published • 24 - 
	
	
	
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 129 - 
	
	
	
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Paper • 2404.06512 • Published • 30 
model training
			
			
	
	Reward Modeling
			
			
	
	models to test out
			
			
	
	data synthesis
			
			
	
	RL_Papers in general
			
			
	
	- 
	
	
	
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
Paper • 2504.08672 • Published • 55 - 
	
	
	
A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis
Paper • 2504.12322 • Published • 28 - 
	
	
	
Learning to Reason under Off-Policy Guidance
Paper • 2504.14945 • Published • 88 - 
	
	
	
TTRL: Test-Time Reinforcement Learning
Paper • 2504.16084 • Published • 120 
OCR
			
			
	
	- 
	
	
	
Gemma 3 Technical Report
Paper • 2503.19786 • Published • 54 - 
	
	
	
Kimi-VL Technical Report
Paper • 2504.07491 • Published • 132 - 
	
	
	
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 300 - 
	
	
	
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Paper • 2504.09925 • Published • 38 
imageGen
			
			
	
	- 
	
	
	
Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models
Paper • 2503.18446 • Published • 12 - 
	
	
	
Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models
Paper • 2503.20240 • Published • 22 - 
	
	
	
BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation
Paper • 2503.20672 • Published • 14 - 
	
	
	
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
Paper • 2503.20198 • Published • 4 
VLM RL Reasoning
			
			
	
	- 
	
	
	
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement
Paper • 2503.17352 • Published • 24 - 
	
	
	
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation
Paper • 2503.16660 • Published • 72 - 
	
	
	
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Paper • 2503.18931 • Published • 30 - 
	
	
	
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
Paper • 2503.13964 • Published • 20 
LLM-External_information
			
			
	
	- 
	
	
	
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Paper • 2310.11511 • Published • 78 - 
	
	
	
Improving Text Embeddings with Large Language Models
Paper • 2401.00368 • Published • 82 - 
	
	
	
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Paper • 2404.05961 • Published • 66 
llm_compression
			
			
	
	LLM_Reasoning-ErrorCorrection
			
			
	
	Loras
			
			
	
	3D (nerfs, gaussians, generation etc.)
			
			
	
	t2i consistency works
			
			
	
	videogames_roleplay
			
			
	
	small_or_multimodal_llm
			
			
	
	manga_translation
			
			
	
	- 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 - 
	
	
	
PALO: A Polyglot Large Multimodal Model for 5B People
Paper • 2402.14818 • Published • 24 - 
	
	
	
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 129 - 
	
	
	
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Paper • 2404.06512 • Published • 30 
long_context
			
			
	
	model training