Papers
arxiv:2510.24717

Uniform Discrete Diffusion with Metric Path for Video Generation

Published on Oct 28
ยท Submitted by Haoge Deng on Oct 29
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

URSA, a discrete generative model, bridges the gap with continuous approaches in video generation by using a Linearized Metric Path and Resolution-dependent Timestep Shifting, achieving high-resolution and long-duration synthesis with fewer inference steps.

AI-generated summary

Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA

Community

Paper submitter

We present URSA (Uniform discRete diffuSion with metric pAth), a simple yet powerful framework that bridges the gap with continuous approaches. URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens and scales efficiently to long video generation, requiring fewer inference steps. URSA enables multi-task video generation with asynchronous timestep scheduling strategy in one unified model.

  • ๐Ÿฅ‡ Novel Approach: Uniform Discrete Diffusion with Metric Path.
  • ๐Ÿฅˆ SOTA Performance: High efficiency with state-of-the-art T2I/T2V/I2V results.
  • ๐Ÿฅ‰ Unified Modeling: Multi-task capabilities in a single unified model.

Paper link: https://arxiv.org/abs/2510.24717
Code available at: https://github.com/baaivision/URSA

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.24717 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 1