Papers
arxiv:2503.23377

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Published on Mar 30
Ā· Submitted by Hao Fei on Apr 4
Authors:
,
,
,
,
,
,
,

Abstract

JavisDiT, a Joint Audio-Video Diffusion Transformer, generates high-quality synchronized audio-video content using a Hierarchical Spatial-Temporal Synchronized Prior Estimator, excelling in a new benchmark with a robust synchronization metric.

AI-generated summary

This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at https://javisdit.github.io/.

Community

Paper author Paper submitter
•
edited Apr 8

šŸ”„šŸ”„šŸ”„ JavisDiT

🌟 We introduce JavisDiT, a novel & SoTA Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG) from open-ended user prompts.

🤠 We contribute JavisBench, a new large-scale JAVG benchmark dataset with challenging scenarios, along with robust metrics to evaluate audio-video synchronization.

šŸ“ Paper: https://arxiv.org/abs/2503.23377
šŸŽ‰ Project: https://javisdit.github.io/
✨ Code: https://github.com/JavisDiT/JavisDiT

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Paper author Paper submitter
•
edited Apr 9

Our code is out: https://github.com/JavisDiT/JavisDiT
Welcome star and issues!


🤠 Our gradio demo is out, welcome trying out: https://447c629bc8648ce599.gradio.live

nice work

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.23377 in a Space README.md to link it from this page.

Collections including this paper 3