Papers
arxiv:2510.27492

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Published on Oct 30
Β· Submitted by taesiri on Nov 3
#2 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.

Community

Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.

ThinkMorph MultiModal Reasoning on Vision-centric Tasks:
thinkmorph_main

Β·

The four tasks we interleave:

  • 🧩 Jigsaw Assembly β€” rearrange patches with visual verification
  • πŸ—ΊοΈ Spatial Navigation β€” overlay and validate routes
  • πŸ” Visual Search β€” draw precise boxes to ground answers
  • πŸ“Š Chart Refocus β€” highlight regions, then compute

Result: +86.67% Nav, +38.75% Jigsaw, +34.74% avg.

Representative Emergent Properties in Interleaved Reasoning:
emrging_prop

ThinkMorph generalizes to out-of-domain benchmarks:
main_result

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 4

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.27492 in a Space README.md to link it from this page.

Collections including this paper 5