Transformers documentation
SAM3
This model was released on 2025-11-19 and added to Hugging Face Transformers on 2025-11-19.
SAM3
Overview
SAM3 (Segment Anything Model 3) was introduced in SAM 3: Segment Anything with Concepts.
SAM3 performs Promptable Concept Segmentation (PCS) on images. PCS takes text and/or image exemplars as input (e.g., “yellow school bus”), and predicts instance and semantic masks for every single object matching the concept.
The abstract from the paper is the following:
We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., “yellow school bus”), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.
This model was contributed by yonigozlan and ronghanghu.
Usage examples with 🤗 Transformers
Text-Only Prompts
>>> from transformers import Sam3Processor, Sam3Model
>>> import torch
>>> from PIL import Image
>>> import requests
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> model = Sam3Model.from_pretrained("facebook/sam3").to(device)
>>> processor = Sam3Processor.from_pretrained("facebook/sam3")
>>> # Load image
>>> image_url = "http://images.cocodataset.org/val2017/000000077595.jpg"
>>> image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
>>> # Segment using text prompt
>>> inputs = processor(images=image, text="ear", return_tensors="pt").to(device)
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> # Post-process results
>>> results = processor.post_process_instance_segmentation(
... outputs,
... threshold=0.5,
... mask_threshold=0.5,
... target_sizes=inputs.get("original_sizes").tolist()
... )[0]
>>> print(f"Found {len(results['masks'])} objects")
>>> # Results contain:
>>> # - masks: Binary masks resized to original image size
>>> # - boxes: Bounding boxes in absolute pixel coordinates (xyxy format)
>>> # - scores: Confidence scoresSingle Bounding Box Prompt
Segment objects using a bounding box on the visual concept:
>>> # Box in xyxy format: [x1, y1, x2, y2] in pixel coordinates
>>> # Example: laptop region
>>> box_xyxy = [100, 150, 500, 450]
>>> input_boxes = [[box_xyxy]] # [batch, num_boxes, 4]
>>> input_boxes_labels = [[1]] # 1 = positive box
>>> inputs = processor(
... images=image,
... input_boxes=input_boxes,
... input_boxes_labels=input_boxes_labels,
... return_tensors="pt"
... ).to(device)
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> # Post-process results
>>> results = processor.post_process_instance_segmentation(
... outputs,
... threshold=0.5,
... mask_threshold=0.5,
... target_sizes=inputs.get("original_sizes").tolist()
... )[0]Multiple Box Prompts (Positive and Negative)
Use multiple boxes with positive and negative labels to refine the concept:
>>> # Load kitchen image
>>> kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"
>>> kitchen_image = Image.open(requests.get(kitchen_url, stream=True).raw).convert("RGB")
>>> # Define two positive boxes (e.g., dial and button on oven)
>>> # Boxes are in xyxy format [x1, y1, x2, y2] in pixel coordinates
>>> box1_xyxy = [59, 144, 76, 163] # Dial box
>>> box2_xyxy = [87, 148, 104, 159] # Button box
>>> input_boxes = [[box1_xyxy, box2_xyxy]]
>>> input_boxes_labels = [[1, 1]] # Both positive
>>> inputs = processor(
... images=kitchen_image,
... input_boxes=input_boxes,
... input_boxes_labels=input_boxes_labels,
... return_tensors="pt"
... ).to(device)
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> # Post-process results
>>> results = processor.post_process_instance_segmentation(
... outputs,
... threshold=0.5,
... mask_threshold=0.5,
... target_sizes=inputs.get("original_sizes").tolist()
... )[0]Combined Prompts (Text + Negative Box)
Use text prompts with negative visual prompts to refine the concept:
>>> # Segment "handle" but exclude the oven handle using a negative box
>>> text = "handle"
>>> # Negative box covering oven handle area (xyxy): [40, 183, 318, 204]
>>> oven_handle_box = [40, 183, 318, 204]
>>> input_boxes = [[oven_handle_box]]
>>> inputs = processor(
... images=kitchen_image,
... text=text,
... input_boxes=input_boxes,
... input_boxes_labels=[[0]], # 0 = negative (exclude this region)
... return_tensors="pt"
... ).to(device)
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> # Post-process results
>>> results = processor.post_process_instance_segmentation(
... outputs,
... threshold=0.5,
... mask_threshold=0.5,
... target_sizes=inputs.get("original_sizes").tolist()
... )[0]
>>> # This will segment pot handles but exclude the oven handleBatched Inference with Text Prompts
Process multiple images with different text prompts efficiently:
>>> cat_url = "http://images.cocodataset.org/val2017/000000077595.jpg"
>>> kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"
>>> images = [
... Image.open(requests.get(cat_url, stream=True).raw).convert("RGB"),
... Image.open(requests.get(kitchen_url, stream=True).raw).convert("RGB")
... ]
>>> # Different text prompt for each image
>>> text_prompts = ["ear", "dial"]
>>> inputs = processor(images=images, text=text_prompts, return_tensors="pt").to(device)
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> # Post-process results for both images
>>> results = processor.post_process_instance_segmentation(
... outputs,
... threshold=0.5,
... mask_threshold=0.5,
... target_sizes=inputs.get("original_sizes").tolist()
... )
>>> print(f"Image 1: {len(results[0]['masks'])} objects found")
>>> print(f"Image 2: {len(results[1]['masks'])} objects found")Batched Mixed Prompts
Use different prompt types for different images in the same batch:
>>> # Image 1: text prompt "laptop"
>>> # Image 2: visual prompt (dial box)
>>> box2_xyxy = [59, 144, 76, 163]
>>> inputs = processor(
... images=images,
... text=["laptop", None], # Only first image has text
... input_boxes=[None, [box2_xyxy]], # Only second image has box
... input_boxes_labels=[None, [1]], # Positive box for second image
... return_tensors="pt"
... ).to(device)
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> # Post-process results for both images
>>> results = processor.post_process_instance_segmentation(
... outputs,
... threshold=0.5,
... mask_threshold=0.5,
... target_sizes=inputs.get("original_sizes").tolist()
... )
>>> # Both images processed in single forward passSemantic Segmentation Output
SAM3 also provides semantic segmentation alongside instance masks:
>>> inputs = processor(images=image, text="ear", return_tensors="pt").to(device)
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> # Instance segmentation masks
>>> instance_masks = torch.sigmoid(outputs.pred_masks) # [batch, num_queries, H, W]
>>> # Semantic segmentation (single channel)
>>> semantic_seg = outputs.semantic_seg # [batch, 1, H, W]
>>> print(f"Instance masks: {instance_masks.shape}")
>>> print(f"Semantic segmentation: {semantic_seg.shape}")Efficient Multi-Prompt Inference on Single Image
When running multiple text prompts on the same image, pre-compute vision embeddings to avoid redundant computation:
>>> from transformers import Sam3Processor, Sam3Model
>>> import torch
>>> from PIL import Image
>>> import requests
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> model = Sam3Model.from_pretrained("facebook/sam3").to(device)
>>> processor = Sam3Processor.from_pretrained("facebook/sam3")
>>> # Load image
>>> image_url = "http://images.cocodataset.org/val2017/000000077595.jpg"
>>> image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
>>> # Pre-process image and compute vision embeddings once
>>> img_inputs = processor(images=image, return_tensors="pt").to(device)
>>> with torch.no_grad():
... vision_embeds = model.get_vision_features(pixel_values=img_inputs.pixel_values)
>>> # Run multiple text prompts efficiently
>>> text_prompts = ["ear", "eye", "nose"]
>>> all_results = []
>>> for prompt in text_prompts:
... text_inputs = processor(text=prompt, return_tensors="pt").to(device)
... with torch.no_grad():
... outputs = model(vision_embeds=vision_embeds, **text_inputs)
...
... results = processor.post_process_instance_segmentation(
... outputs,
... threshold=0.5,
... mask_threshold=0.5,
... target_sizes=img_inputs.get("original_sizes").tolist()
... )[0]
... all_results.append({"prompt": prompt, "results": results})
>>> for item in all_results:
... print(f"Prompt '{item['prompt']}': {len(item['results']['masks'])} objects found")Efficient Single-Prompt Inference on Multiple Images
When running the same text prompt on multiple images, pre-compute text embeddings to avoid redundant computation:
>>> from transformers import Sam3Processor, Sam3Model
>>> import torch
>>> from PIL import Image
>>> import requests
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> model = Sam3Model.from_pretrained("facebook/sam3").to(device)
>>> processor = Sam3Processor.from_pretrained("facebook/sam3")
>>> # Pre-compute text embeddings once
>>> text_prompt = "ear"
>>> text_inputs = processor(text=text_prompt, return_tensors="pt").to(device)
>>> with torch.no_grad():
... text_embeds = model.get_text_features(**text_inputs)
>>> # Load multiple images
>>> image_urls = [
... "http://images.cocodataset.org/val2017/000000077595.jpg",
... "http://images.cocodataset.org/val2017/000000039769.jpg",
... ]
>>> images = [Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in image_urls]
>>> # Run inference on each image reusing text embeddings
>>> # Note: attention_mask must be passed along with text_embeds for proper masking
>>> all_results = []
>>> for image in images:
... img_inputs = processor(images=image, return_tensors="pt").to(device)
... with torch.no_grad():
... outputs = model(
... pixel_values=img_inputs.pixel_values,
... text_embeds=text_embeds,
... attention_mask=text_inputs.attention_mask,
... )
...
... results = processor.post_process_instance_segmentation(
... outputs,
... threshold=0.5,
... mask_threshold=0.5,
... target_sizes=img_inputs.get("original_sizes").tolist()
... )[0]
... all_results.append(results)
>>> for i, results in enumerate(all_results):
... print(f"Image {i+1}: {len(results['masks'])} '{text_prompt}' objects found")Prompt Label Conventions
SAM3 uses the following label conventions:
For points and boxes:
1: Positive prompt (include this region/object)0: Negative prompt (exclude this region/object)-10: Padding value for batched inputs
Coordinate formats:
- Input boxes:
[x1, y1, x2, y2](xyxy format) in pixel coordinates - Output boxes (raw):
[x1, y1, x2, y2](xyxy format), normalized to [0, 1] - Output boxes (post-processed):
[x1, y1, x2, y2](xyxy format) in absolute pixel coordinates
Sam3Config
class transformers.Sam3Config
< source >( vision_config = None text_config = None geometry_encoder_config = None detr_encoder_config = None detr_decoder_config = None mask_decoder_config = None initializer_range = 0.02 **kwargs )
Parameters
- vision_config (
dictorSam3VisionConfig, optional) — Configuration for the vision encoder. - text_config (
dictorSam3TextConfig, optional) — Configuration for the text encoder. - geometry_encoder_config (
dictorSam3GeometryEncoderConfig, optional) — Configuration for the geometry encoder. - detr_encoder_config (
dictorSam3DETREncoderConfig, optional) — Configuration for the DETR encoder. - detr_decoder_config (
dictorSam3DETRDecoderConfig, optional) — Configuration for the DETR decoder. - mask_decoder_config (
dictorSam3MaskDecoderConfig, optional) — Configuration for the mask decoder. - initializer_range (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing weight matrices.
Configuration class to store the configuration of a Sam3Model.
Instantiating a configuration defaults will yield a similar configuration to that of SAM 3 facebook/sam3 architecture.
This is the main configuration class that combines all sub-configurations for the SAM3 model.
Sam3ViTConfig
class transformers.Sam3ViTConfig
< source >( hidden_size = 1024 intermediate_size = 4736 num_hidden_layers = 32 num_attention_heads = 16 num_channels = 3 image_size = 1008 patch_size = 14 hidden_act = 'gelu' layer_norm_eps = 1e-06 attention_dropout = 0.0 rope_theta = 10000.0 window_size = 24 global_attn_indexes = None layer_scale_init_value = None pretrain_image_size = 336 hidden_dropout = 0.0 initializer_range = 0.02 **kwargs )
Parameters
- hidden_size (
int, optional, defaults to 1024) — Dimensionality of the encoder layers. - intermediate_size (
int, optional, defaults to 4736) — Dimensionality of the feedforward (MLP) layers. - num_hidden_layers (
int, optional, defaults to 32) — Number of hidden layers in the Transformer encoder. - num_attention_heads (
int, optional, defaults to 16) — Number of attention heads for each attention layer. - num_channels (
int, optional, defaults to 3) — Number of input image channels. - image_size (
int, optional, defaults to 1008) — Expected input image size. - patch_size (
int, optional, defaults to 14) — Size of image patches. - hidden_act (
str, optional, defaults to"gelu") — The non-linear activation function. - layer_norm_eps (
float, optional, defaults to 1e-06) — The epsilon used by layer normalization layers. - attention_dropout (
float, optional, defaults to 0.0) — The dropout ratio for attention probabilities. - rope_theta (
float, optional, defaults to 10000.0) — Base frequency for RoPE. - window_size (
int, optional, defaults to 24) — Window size for windowed attention. - global_attn_indexes (
list[int], optional, defaults to[7, 15, 23, 31]) — Indexes of layers with global attention. - layer_scale_init_value (
float, optional) — Initial value for layer scale. None means no layer scale. - pretrain_image_size (
int, optional, defaults to 336) — Pretrained model image size for position embedding initialization. - hidden_dropout (
float, optional, defaults to 0.0) — Dropout probability for hidden states. - initializer_range (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing weight matrices.
Configuration class for SAM3 Vision Encoder (ViT backbone).
Instantiating a configuration defaults will yield a similar configuration to that of SAM 3 facebook/sam3 architecture.
Sam3VisionConfig
class transformers.Sam3VisionConfig
< source >( backbone_config = None fpn_hidden_size = 256 backbone_feature_sizes = None scale_factors = None hidden_act = 'gelu' layer_norm_eps = 1e-06 initializer_range = 0.02 **kwargs )
Parameters
- backbone_config (
Union[dict, "PreTrainedConfig"], optional) — Configuration for the vision backbone. This is used to instantiate the backbone usingAutoModel.from_config. - fpn_hidden_size (
int, optional, defaults to 256) — The hidden dimension of the FPN. - backbone_feature_sizes (
List[List[int]], optional, defaults to[[288, 288], [144, 144], [72, 72]]) — The spatial sizes (height, width) of the feature maps from the backbone at different scales. - scale_factors (
list[float], optional, defaults to[4.0, 2.0, 1.0, 0.5]) — Scale factors for FPN multi-scale features. List of scaling factors for each FPN level. - hidden_act (
str, optional, defaults to"gelu") — The non-linear activation function in the neck. - layer_norm_eps (
float, optional, defaults to 1e-06) — The epsilon for the layer normalization. - initializer_range (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
This is the configuration class to store the configuration of a Sam3VisionModel. It is used to instantiate a SAM vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration defaults will yield a similar configuration to that of SAM 3 facebook/sam3 architecture.
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
Sam3GeometryEncoderConfig
class transformers.Sam3GeometryEncoderConfig
< source >( hidden_size = 256 num_layers = 3 num_attention_heads = 8 intermediate_size = 2048 dropout = 0.1 hidden_act = 'relu' hidden_dropout = 0.0 layer_norm_eps = 1e-06 roi_size = 7 initializer_range = 0.02 **kwargs )
Parameters
- hidden_size (
int, optional, defaults to 256) — Dimensionality of the encoder layers. - num_layers (
int, optional, defaults to 3) — Number of transformer encoder layers for processing geometry prompts. - num_attention_heads (
int, optional, defaults to 8) — Number of attention heads in the geometry encoder. - intermediate_size (
int, optional, defaults to 2048) — Dimensionality of the feedforward layers. - dropout (
float, optional, defaults to 0.1) — Dropout probability. - hidden_act (
str, optional, defaults to"relu") — Activation function in FFN. - hidden_dropout (
float, optional, defaults to 0.0) — Dropout probability for hidden states. - layer_norm_eps (
float, optional, defaults to 1e-06) — Epsilon for layer normalization. - roi_size (
int, optional, defaults to 7) — ROI size for box pooling operations. - initializer_range (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing weight matrices.
Configuration class for SAM3 Geometry Encoder.
Sam3DETREncoderConfig
class transformers.Sam3DETREncoderConfig
< source >( hidden_size = 256 num_layers = 6 num_attention_heads = 8 intermediate_size = 2048 dropout = 0.1 hidden_act = 'relu' hidden_dropout = 0.0 layer_norm_eps = 1e-06 initializer_range = 0.02 **kwargs )
Parameters
- hidden_size (
int, optional, defaults to 256) — Dimensionality of the encoder layers. - num_layers (
int, optional, defaults to 6) — Number of encoder layers. - num_attention_heads (
int, optional, defaults to 8) — Number of attention heads. - intermediate_size (
int, optional, defaults to 2048) — Dimensionality of the feedforward layers. - dropout (
float, optional, defaults to 0.1) — Dropout probability. - hidden_act (
str, optional, defaults to"relu") — Activation function in FFN. - hidden_dropout (
float, optional, defaults to 0.0) — Dropout probability for hidden states. - layer_norm_eps (
float, optional, defaults to 1e-06) — Epsilon for layer normalization. - initializer_range (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing weight matrices.
Configuration class for SAM3 DETR Encoder (vision-text fusion encoder).
Sam3DETRDecoderConfig
class transformers.Sam3DETRDecoderConfig
< source >( hidden_size = 256 num_layers = 6 num_queries = 200 num_attention_heads = 8 intermediate_size = 2048 dropout = 0.1 hidden_act = 'relu' hidden_dropout = 0.0 layer_norm_eps = 1e-06 initializer_range = 0.02 **kwargs )
Parameters
- hidden_size (
int, optional, defaults to 256) — Dimensionality of the decoder layers. - num_layers (
int, optional, defaults to 6) — Number of decoder layers. - num_queries (
int, optional, defaults to 200) — Number of object queries. - num_attention_heads (
int, optional, defaults to 8) — Number of attention heads. - intermediate_size (
int, optional, defaults to 2048) — Dimensionality of the feedforward layers. - dropout (
float, optional, defaults to 0.1) — Dropout probability. - hidden_act (
str, optional, defaults to"relu") — Activation function in FFN. - hidden_dropout (
float, optional, defaults to 0.0) — Dropout probability for hidden states. - layer_norm_eps (
float, optional, defaults to 1e-06) — Epsilon for layer normalization. - initializer_range (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing weight matrices.
Configuration class for SAM3 DETR Decoder (object query decoder).
Sam3MaskDecoderConfig
class transformers.Sam3MaskDecoderConfig
< source >( hidden_size = 256 num_upsampling_stages = 3 layer_norm_eps = 1e-06 dropout = 0.0 num_attention_heads = 8 initializer_range = 0.02 **kwargs )
Parameters
- hidden_size (
int, optional, defaults to 256) — Dimensionality of the mask decoder. - num_upsampling_stages (
int, optional, defaults to 3) — Number of upsampling stages in the pixel decoder (FPN). - layer_norm_eps (
float, optional, defaults to 1e-06) — Epsilon for layer normalization. - dropout (
float, optional, defaults to 0.0) — Dropout probability for prompt cross-attention. - num_attention_heads (
int, optional, defaults to 8) — Number of attention heads for prompt cross-attention. - initializer_range (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing weight matrices.
Configuration class for SAM3 Mask Decoder (pixel-level mask prediction).
Sam3Processor
class transformers.Sam3Processor
< source >( image_processor tokenizer target_size: typing.Optional[int] = None point_pad_value: int = -10 **kwargs )
Parameters
- image_processor (
Sam2ImageProcessorFast) — An instance of Sam2ImageProcessorFast. - tokenizer ([
PreTrainedTokenizer,PreTrainedTokenizerFast]) — An instance of [PreTrainedTokenizer,PreTrainedTokenizerFast]. The tokenizer is a required input. - target_size (
int, optional) — The target size (target_size, target_size) to which the image will be resized. - point_pad_value (
int, optional, defaults to -10) — The value used for padding input boxes.
Constructs a SAM3 processor which wraps a SAM3 image processor and bounding boxes processing into a single processor.
Sam2Processor offers all the functionalities of Sam2ImageProcessorFast and Sam2VideoProcessor. See the docstring of call() and call() for more information.
__call__
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None text: typing.Union[str, list[str], list[list[str]], NoneType] = None segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None input_boxes: typing.Union[list[list[list[float]]], torch.Tensor, NoneType] = None input_boxes_labels: typing.Union[list[list[list[int]]], torch.Tensor, NoneType] = None original_sizes: typing.Union[list[list[float]], torch.Tensor, NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None **kwargs ) → A BatchEncoding with the following fields
Parameters
- images (
ImageInput, optional) — The image(s) to process. - text (
str,list[str],list[list[str]], optional) — The text to process. - segmentation_maps (
ImageInput, optional) — The segmentation maps to process. - input_boxes (
list[list[list[float]]],torch.Tensor, optional) — The bounding boxes to process. - input_boxes_labels (
list[list[int]],torch.Tensor, optional) — The labels for the bounding boxes. - original_sizes (
list[list[float]],torch.Tensor, optional) — The original sizes of the images. - return_tensors (
strorTensorType, optional) — The type of tensors to return. - **kwargs — Additional keyword arguments to pass to the image processor.
Returns
A BatchEncoding with the following fields
pixel_values(torch.Tensor): The processed image(s).original_sizes(list[list[float]]): The original sizes of the images.labels(torch.Tensor): The processed segmentation maps (if provided).input_boxes_labels(torch.Tensor): The processed labels for the bounding boxes.input_boxes(torch.Tensor): The processed bounding boxes.
This method uses Sam3ImageProcessorFast.call() method to prepare image(s) for the model. It also prepares bounding boxes for the model if they are provided.
Sam3ImageProcessorFast
class transformers.Sam3ImageProcessorFast
< source >( **kwargs: typing_extensions.Unpack[transformers.models.sam3.image_processing_sam3_fast.Sam3FastImageProcessorKwargs] )
Constructs a fast Sam3 image processor.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None **kwargs: typing_extensions.Unpack[transformers.models.sam3.image_processing_sam3_fast.Sam3FastImageProcessorKwargs] ) → <class 'transformers.image_processing_base.BatchFeature'>
Parameters
- images (
Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False. - segmentation_maps (
ImageInput, optional) — The segmentation maps to preprocess. - do_convert_rgb (
bool, optional) — Whether to convert the image to RGB. - do_resize (
bool, optional) — Whether to resize the image. - size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) — Describes the maximum input dimensions to the model. - crop_size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) — Size of the output image after applyingcenter_crop. - resample (
Annotated[Union[PILImageResampling, int, NoneType], None]) — Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Only has an effect ifdo_resizeis set toTrue. - do_rescale (
bool, optional) — Whether to rescale the image. - rescale_factor (
float, optional) — Rescale factor to rescale the image by ifdo_rescaleis set toTrue. - do_normalize (
bool, optional) — Whether to normalize the image. - image_mean (
Union[float, list[float], tuple[float, ...], NoneType]) — Image mean to use for normalization. Only has an effect ifdo_normalizeis set toTrue. - image_std (
Union[float, list[float], tuple[float, ...], NoneType]) — Image standard deviation to use for normalization. Only has an effect ifdo_normalizeis set toTrue. - do_pad (
bool, optional) — Whether to pad the image. Padding is done either to the largest size in the batch or to a fixed square size per image. The exact padding strategy depends on the model. - pad_size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) — The size in{"height": int, "width" int}to pad the images to. Must be larger than any image size provided for preprocessing. Ifpad_sizeis not provided, images will be padded to the largest height and width in the batch. Applied only whendo_pad=True. - do_center_crop (
bool, optional) — Whether to center crop the image. - data_format (
Union[~image_utils.ChannelDimension, str, NoneType]) — OnlyChannelDimension.FIRSTis supported. Added for compatibility with slow processors. - input_data_format (
Union[~image_utils.ChannelDimension, str, NoneType]) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format."none"orChannelDimension.NONE: image in (height, width) format.
- device (
Annotated[Union[str, torch.device, NoneType], None]) — The device to process the images on. If unset, the device is inferred from the input images. - return_tensors (
Annotated[Union[str, ~utils.generic.TensorType, NoneType], None]) — Returns stacked tensors if set to `pt, otherwise returns a list of tensors. - disable_grouping (
bool, optional) — Whether to disable grouping of images by size to process them individually and not in batches. If None, will be set to True if the images are on CPU, and False otherwise. This choice is based on empirical observations, as detailed here: https://github.com/huggingface/transformers/pull/38157 - image_seq_length (
int, optional) — The number of image tokens to be used for each image in the input. Added for backward compatibility but this should be set as a processor attribute in future models. - mask_size (
dict[str, int], optional) — The size{"height": int, "width": int}to resize the segmentation maps to.
Returns
<class 'transformers.image_processing_base.BatchFeature'>
- data (
dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.). - tensor_type (
Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.
Sam3ViTModel
class transformers.Sam3ViTModel
< source >( config: Sam3ViTConfig )
Parameters
- config (Sam3ViTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Sam3 Model outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: Tensor **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.Tensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained usingimage_processor_class. Seeimage_processor_class.__call__for details (Sam3Processor usesimage_processor_classfor processing images).
Returns
transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (Sam3Config) and inputs.
-
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model. -
hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The Sam3ViTModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Sam3VisionModel
class transformers.Sam3VisionModel
< source >( config: Sam3VisionConfig )
Parameters
- config (Sam3VisionConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The vision model from Sam without any head or projection on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: typing.Optional[torch.FloatTensor] = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] )
Sam3Model
forward
< source >( pixel_values: typing.Optional[torch.FloatTensor] = None vision_embeds: typing.Optional[transformers.models.sam3.modeling_sam3.Sam3VisionEncoderOutput] = None input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None text_embeds: typing.Optional[torch.FloatTensor] = None input_boxes: typing.Optional[torch.FloatTensor] = None input_boxes_labels: typing.Optional[torch.LongTensor] = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.models.sam3.modeling_sam3.Sam3ImageSegmentationOutput or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained usingimage_processor_class. Seeimage_processor_class.__call__for details (Sam3Processor usesimage_processor_classfor processing images). - vision_embeds (
Sam3VisionEncoderOutput, optional) — Pre-computed vision embeddings. Can be used to easily reuse vision embeddings. If provided,pixel_valuesshould not be passed. Mutually exclusive withpixel_values. - input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- text_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Pre-computed text embeddings. Can be used to easily reuse text embeddings. If provided,input_idsshould not be passed. Mutually exclusive withinput_ids. - input_boxes (
torch.FloatTensorof shape(batch_size, num_boxes, 4), optional) — Normalized box coordinates in [0, 1] range, in (cx, cy, w, h) format. - input_boxes_labels (
torch.LongTensorof shape(batch_size, num_boxes), optional) — Labels for boxes: 1 (positive), 0 (negative).
Returns
transformers.models.sam3.modeling_sam3.Sam3ImageSegmentationOutput or tuple(torch.FloatTensor)
A transformers.models.sam3.modeling_sam3.Sam3ImageSegmentationOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (Sam3Config) and inputs.
- pred_masks (
torch.FloatTensorof shape(batch_size, num_queries, height, width)) — Predicted segmentation masks for each query. - pred_boxes (
torch.FloatTensorof shape(batch_size, num_queries, 4)) — Predicted bounding boxes in (x1, y1, x2, y2) format. - pred_logits (
torch.FloatTensorof shape(batch_size, num_queries), optional) — Classification confidence scores for each query, computed via dot product between decoder query features and text features. - presence_logits (
torch.FloatTensorof shape(batch_size, 1), optional) — Presence logits from the DETR decoder presence token (last layer only). These indicate whether objects are present in the scene. Can be used to compute final scores by multiplying with pred_logits —final_scores = pred_logits.sigmoid() * presence_logits.sigmoid(). - semantic_seg (
torch.FloatTensorof shape(batch_size, 1, height, width), optional) — Semantic segmentation output. - decoder_hidden_states (
tuple[torch.FloatTensor], optional) — Tuple of hidden states from all DETR decoder layers. Each tensor has shape(batch_size, num_queries, hidden_size). - decoder_reference_boxes (
torch.FloatTensorof shape(num_layers, batch_size, num_queries, 4), optional) — Reference boxes from all DETR decoder layers. - encoder_hidden_states (
tuple[torch.FloatTensor], optional) — Tuple of hidden states from all DETR encoder layers. - vision_hidden_states (
tuple[torch.FloatTensor], optional) — Tuple of hidden states from all vision encoder (ViT) layers. - vision_attentions (
tuple[torch.FloatTensor], optional) — Attention weights from vision encoder (ViT) layers. - detr_encoder_attentions (
tuple[torch.FloatTensor], optional) — Attention weights from DETR encoder layers. - detr_decoder_attentions (
tuple[torch.FloatTensor], optional) — Attention weights from DETR decoder layers (self-attention and cross-attention). - mask_decoder_attentions (
tuple[torch.FloatTensor], optional) — Attention weights from mask decoder layers.
The Sam3Model forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoModel, AutoProcessor
>>> model = AutoModel.from_pretrained("facebook/sam3")
>>> processor = AutoProcessor.from_pretrained("facebook/sam3")
>>> img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/sam-car.png"
>>> raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
>>> text = "car"
>>> inputs = processor(images=raw_image, text=text, return_tensors="pt")
>>> # Get segmentation output
>>> outputs = model(**inputs)
>>> pred_masks = outputs.pred_masks
>>> pred_boxes = outputs.pred_boxes