19 51 20

Xiangtai Li

LXT

https://lxtgh.github.io/

AI & ML interests

Computer Vision, Multi-Modal Understanding, Generative AI

Recent Activity

upvoted a paper about 8 hours ago

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

upvoted a paper about 8 hours ago

STEP3-VL-10B Technical Report

upvoted a paper 3 days ago

BabyVision: Visual Reasoning Beyond Language

View all activity

Organizations

upvoted 2 papers about 8 hours ago

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Paper • 2601.10611 • Published 1 day ago • 15

STEP3-VL-10B Technical Report

Paper • 2601.09668 • Published 2 days ago • 129

upvoted 2 papers 3 days ago

BabyVision: Visual Reasoning Beyond Language

Paper • 2601.06521 • Published 6 days ago • 179

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Paper • 2601.06943 • Published 5 days ago • 201

upvoted 2 papers 29 days ago

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Paper • 2512.16760 • Published 29 days ago • 13

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Paper • 2512.15745 • Published Dec 10, 2025 • 78

upvoted a paper 30 days ago

RecTok: Reconstruction Distillation along Rectified Flow

Paper • 2512.13421 • Published Dec 15, 2025 • 4

upvoted a paper about 1 month ago

Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

Paper • 2512.02457 • Published Dec 2, 2025 • 13

upvoted 2 papers about 2 months ago

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

Paper • 2511.13853 • Published Nov 17, 2025 • 34

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Paper • 2511.09611 • Published Nov 12, 2025 • 69

upvoted a paper 2 months ago

Visual Spatial Tuning

Paper • 2511.05491 • Published Nov 7, 2025 • 51

upvoted 6 papers 3 months ago

PairUni: Pairwise Training for Unified Multimodal Language Models

Paper • 2510.25682 • Published Oct 29, 2025 • 13

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Paper • 2510.26802 • Published Oct 30, 2025 • 33

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Paper • 2510.18876 • Published Oct 21, 2025 • 36

upvoted a collection 3 months ago

Sa2VA Model Zoo

Collection

Huggingace Model Zoo For Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos By Bytedance Seed CV Research • 12 items • Updated Nov 27, 2025 • 44

upvoted 2 papers 3 months ago

Detect Anything via Next Point Prediction

Paper • 2510.12798 • Published Oct 14, 2025 • 46

Diffusion Transformers with Representation Autoencoders

Paper • 2510.11690 • Published Oct 13, 2025 • 165

Xiangtai Li

AI & ML interests

Recent Activity

Organizations

LXT's activity