In this work, we present AMD Hummingbird-XT, an efficient DiT-based video generative model designed for high-quality video generation on client-grade GPUs with 5B parameters .

Hummingbird-XT is trained based on Wan2.2-5B-TI2V using DMD step distillation with carefully designed data curation, enabling 3-step generation while preserving high visual fidelity and motion quality. To reduce the computational overhead of high-resolution video decoding in 3D convolution–based VAE decoders, we introduce a lightweight and efficient VAE decoder by replacing part of the 3D convolutions with depthwise separable convolutions. Additionally, to further extend the length of generated videos, we introduce Hummingbird-XTX, an efficient autoregressive model for long-video generation based on Wan-2.1-1.3B, which is capable of generating long videos.

As a result, Hummingbird-XT achieves a 33× speedup on Strix Halo iGPU and a 40× speedup on AMD Instinct™ MI325, and supports generating 121-frame videos at 720×1280 resolution across both server-grade (AMD Instinct™ MI300 and AMD Instinct™ MI325) and client-grade (Strix Halo and Navi48) devices. Quantitative results on the VBench-T2V and VBench-I2V benchmarks show that Hummingbird-XT achieves competitive performance compared to the original Wan2.2-5B-TI2V model.

The Training and inference code is fully released on Hummingbird-XT, and the technical details is released on Bridging the Last Mile: Deploying Hummingbird-XT for Efficient Video Generation on AMD Consumer-Grade Platforms.

Caption	Video
Animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle. The art style is 3D and realistic, with a focus on lighting and texture. The mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and open mouth. Its pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time. The use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image.
A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
The young East Asian man with short black hair, fair skin, and monolid eyes looks ahead. A young East Asian woman with long black hair and fair skin turns to smile warmly at him. The background is blurred, focusing on their shared gaze. Realistic cinematic style.

Hummingbird-XT Text-to-Video Showcases

Caption	Video
a back-view close-up focusing on the runner’s feet striking the track. Only subtle movement occurs—his steps land firmly, kicking a small amount of dust or rubber granules. The camera stays low and straight-on behind him, following smoothly with minimal shake. The sunlight bright with long shadows stretching forward.
A graceful woman stands under a majestic sandstone arch, forming a small heart shape with her fingers close to the camera while smiling warmly and radiating joy. Behind her, a smooth and elegant fountain rises gracefully, its water reflecting the warm, inviting courtyard walls in a mirror-like fashion.
舞台上，一名男子弹奏着一把由闪电构成的电吉他。随着音乐渐强，火花在他周围噼啪作响。突然，耀眼的光芒转为暗红色，他的双眼发出>幽光，黑色的翅膀从背后羽化而出。他的皮肤变得黝黑，闪电缠绕>着他的身体，他化身为一个恶魔，伫立在翻滚的烟雾和雷鸣之中。

Hummingbird-XT Image-to-Video Showcases

Downloads last month: 5

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including amd/HummingbirdXT

Hummingbird

Collection

Hummingbird is a series of video generation models built on AMD Instinct™ GPUs, including text-to-video, image-to-videos models. • 3 items • Updated 10 days ago • 2