Diffusers documentation
Flux2Transformer2DModel
Flux2Transformer2DModel
A Transformer model for image-like data from Flux2.
Flux2Transformer2DModel
class diffusers.Flux2Transformer2DModel
< source >( patch_size: int = 1 in_channels: int = 128 out_channels: typing.Optional[int] = None num_layers: int = 8 num_single_layers: int = 48 attention_head_dim: int = 128 num_attention_heads: int = 48 joint_attention_dim: int = 15360 timestep_guidance_channels: int = 256 mlp_ratio: float = 3.0 axes_dims_rope: typing.Tuple[int, ...] = (32, 32, 32, 32) rope_theta: int = 2000 eps: float = 1e-06 )
Parameters
- patch_size (
int, defaults to1) — Patch size to turn the input data into small patches. - in_channels (
int, defaults to128) — The number of channels in the input. - out_channels (
int, optional, defaults toNone) — The number of channels in the output. If not specified, it defaults toin_channels. - num_layers (
int, defaults to8) — The number of layers of dual stream DiT blocks to use. - num_single_layers (
int, defaults to48) — The number of layers of single stream DiT blocks to use. - attention_head_dim (
int, defaults to128) — The number of dimensions to use for each attention head. - num_attention_heads (
int, defaults to48) — The number of attention heads to use. - joint_attention_dim (
int, defaults to15360) — The number of dimensions to use for the joint attention (embedding/channel dimension ofencoder_hidden_states). - pooled_projection_dim (
int, defaults to768) — The number of dimensions to use for the pooled projection. - guidance_embeds (
bool, defaults toTrue) — Whether to use guidance embeddings for guidance-distilled variant of the model. - axes_dims_rope (
Tuple[int], defaults to(32, 32, 32, 32)) — The dimensions to use for the rotary positional embeddings.
The Transformer model introduced in Flux 2.
Reference: https://blackforestlabs.ai/announcing-black-forest-labs/
forward
< source >( hidden_states: Tensor encoder_hidden_states: Tensor = None timestep: LongTensor = None img_ids: Tensor = None txt_ids: Tensor = None guidance: Tensor = None joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None return_dict: bool = True )
Parameters
- hidden_states (
torch.Tensorof shape(batch_size, image_sequence_length, in_channels)) — Inputhidden_states. - encoder_hidden_states (
torch.Tensorof shape(batch_size, text_sequence_length, joint_attention_dim)) — Conditional embeddings (embeddings computed from the input conditions such as prompts) to use. - timestep (
torch.LongTensor) — Used to indicate denoising step. - block_controlnet_hidden_states — (
listoftorch.Tensor): A list of tensors that if specified are added to the residuals of transformer blocks. - joint_attention_kwargs (
dict, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined underself.processorin diffusers.models.attention_processor. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return a~models.transformer_2d.Transformer2DModelOutputinstead of a plain tuple.
The FluxTransformer2DModel forward method.