Diffusers documentation
WanAnimateTransformer3DModel
WanAnimateTransformer3DModel
A Diffusion Transformer model for 3D video-like data was introduced in Wan Animate by the Alibaba Wan Team.
The model can be loaded with the following code snippet.
from diffusers import WanAnimateTransformer3DModel
transformer = WanAnimateTransformer3DModel.from_pretrained("Wan-AI/Wan2.2-Animate-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)WanAnimateTransformer3DModel
class diffusers.WanAnimateTransformer3DModel
< source >( patch_size: typing.Tuple[int] = (1, 2, 2) num_attention_heads: int = 40 attention_head_dim: int = 128 in_channels: typing.Optional[int] = 36 latent_channels: typing.Optional[int] = 16 out_channels: typing.Optional[int] = 16 text_dim: int = 4096 freq_dim: int = 256 ffn_dim: int = 13824 num_layers: int = 40 cross_attn_norm: bool = True qk_norm: typing.Optional[str] = 'rms_norm_across_heads' eps: float = 1e-06 image_dim: typing.Optional[int] = 1280 added_kv_proj_dim: typing.Optional[int] = None rope_max_seq_len: int = 1024 pos_embed_seq_len: typing.Optional[int] = None motion_encoder_channel_sizes: typing.Optional[typing.Dict[str, int]] = None motion_encoder_size: int = 512 motion_style_dim: int = 512 motion_dim: int = 20 motion_encoder_dim: int = 512 face_encoder_hidden_dim: int = 1024 face_encoder_num_heads: int = 4 inject_face_latents_blocks: int = 5 motion_encoder_batch_size: int = 8 )
Parameters
- patch_size (
Tuple[int], defaults to(1, 2, 2)) — 3D patch dimensions for video embedding (t_patch, h_patch, w_patch). - num_attention_heads (
int, defaults to40) — Fixed length for text embeddings. - attention_head_dim (
int, defaults to128) — The number of channels in each head. - in_channels (
int, defaults to16) — The number of channels in the input. - out_channels (
int, defaults to16) — The number of channels in the output. - text_dim (
int, defaults to512) — Input dimension for text embeddings. - freq_dim (
int, defaults to256) — Dimension for sinusoidal time embeddings. - ffn_dim (
int, defaults to13824) — Intermediate dimension in feed-forward network. - num_layers (
int, defaults to40) — The number of layers of transformer blocks to use. - window_size (
Tuple[int], defaults to(-1, -1)) — Window size for local attention (-1 indicates global attention). - cross_attn_norm (
bool, defaults toTrue) — Enable cross-attention normalization. - qk_norm (
bool, defaults toTrue) — Enable query/key normalization. - eps (
float, defaults to1e-6) — Epsilon value for normalization layers. - image_dim (
int, optional, defaults to1280) — The number of channels to use for the image embedding. IfNone, no projection is used. - added_kv_proj_dim (
int, optional, defaults to5120) — The number of channels to use for the added key and value projections. IfNone, no projection is used.
A Transformer model for video-like data used in the WanAnimate model.
forward
< source >( hidden_states: Tensor timestep: LongTensor encoder_hidden_states: Tensor encoder_hidden_states_image: typing.Optional[torch.Tensor] = None pose_hidden_states: typing.Optional[torch.Tensor] = None face_pixel_values: typing.Optional[torch.Tensor] = None motion_encode_batch_size: typing.Optional[int] = None return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None )
Parameters
- hidden_states (
torch.Tensorof shape(B, 2C + 4, T + 1, H, W)) — Input noisy video latents of shape(B, 2C + 4, T + 1, H, W), where B is the batch size, C is the number of latent channels (16 for Wan VAE), T is the number of latent frames in an inference segment, H is the latent height, and W is the latent width. - timestep — (
torch.LongTensor): The current timestep in the denoising loop. - encoder_hidden_states (
torch.Tensor) — Text embeddings from the text encoder (umT5 for Wan Animate). - encoder_hidden_states_image (
torch.Tensor) — CLIP visual features of the reference (character) image. - pose_hidden_states (
torch.Tensorof shape(B, C, T, H, W)) — Pose video latents. TODO: description - face_pixel_values (
torch.Tensorof shape(B, C', S, H', W')) — Face video in pixel space (not latent space). Typically C’ = 3 and H’ and W’ are the height/width of the face video in pixels. Here S is the inference segment length, usually set to 77. - motion_encode_batch_size (
int, optional) — The batch size for batched encoding of the face video via the motion encoder. Will default toself.config.motion_encoder_batch_sizeif not set. - return_dict (
bool, optional, defaults toTrue) — Whether to return the output as a dict or tuple.
Forward pass of Wan2.2-Animate transformer model.
Transformer2DModelOutput
class diffusers.models.modeling_outputs.Transformer2DModelOutput
< source >( sample: torch.Tensor )
Parameters
- sample (
torch.Tensorof shape(batch_size, num_channels, height, width)or(batch size, num_vector_embeds - 1, num_latent_pixels)if Transformer2DModel is discrete) — The hidden states output conditioned on theencoder_hidden_statesinput. If discrete, returns probability distributions for the unnoised latent pixels.
The output of Transformer2DModel.