When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios Paper • 2507.20198 • Published Jul 27 • 26
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers Paper • 2505.21497 • Published May 27 • 108
HoliTom: Holistic Token Merging for Fast Video Large Language Models Paper • 2505.21334 • Published May 27 • 21
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning Paper • 2505.12448 • Published May 18 • 10
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining Paper • 2505.07608 • Published May 12 • 82
PiTe: Pixel-Temporal Alignment for Large Video-Language Model Paper • 2409.07239 • Published Sep 11, 2024 • 15
🪐 SmolLM Collection A series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos • 12 items • Updated May 5 • 237