HF Kernels - Deformable DETR

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.23s | Raw GitHub 🤗 HF
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Fri Oct 31 20:13:34 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   43C    P0             83W /  350W |       0MiB /  46068MiB |     60%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Deformable DETR Multi-Scale Deformable Attention Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 8.30s | Raw GitHub 🤗 HF
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the deformable DETR kernel
deformable_detr = get_kernel("kernels-community/deformable-detr")


def hf_kernels_deformable_detr(
    value, spatial_shapes, level_start_index, sampling_locations, attention_weights, im2col_step=64
):
    """HuggingFace Kernels Deformable DETR Multi-Scale Deformable Attention"""
    return deformable_detr.ms_deform_attn_forward(
        value=value,
        spatial_shapes=spatial_shapes,
        level_start_index=level_start_index,
        sampling_loc=sampling_locations,
        attn_weight=attention_weights,
        im2col_step=im2col_step
    )


run_benchmark(
    kernel_type=KernelTypeEnum.DEFORMABLE_DETR,
    impl_name="hf_kernels_deformable_detr",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_deformable_detr,
    dtype="float32",
)
Running deformable_detr benchmark on cuda with 4 workloads.

======================================================================
PROFILE TRACE: hf_kernels_deformable_detr | cuda_B1_Q100_H8_E256_L4_P4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             hf_kernels_deformable_detr         0.00%       0.000us         0.00%       0.000us       0.000us     195.201us       770.15%     195.201us     195.201us             1  
                             hf_kernels_deformable_detr         7.43%     141.524us        99.61%       1.898ms       1.898ms       0.000us         0.00%      26.403us      26.403us             1  
       _deformable_detr_57c3d32::ms_deform_attn_forward         3.93%      74.960us        92.19%       1.756ms     585.455us      22.464us        88.63%      26.403us       8.801us             3  
void ms_deformable_im2col_gpu_kernel<float>(int, flo...         0.00%       0.000us         0.00%       0.000us       0.000us      22.464us        88.63%      22.464us       7.488us             3  
                                            aten::zeros         1.20%      22.800us        85.08%       1.621ms     540.337us       0.000us         0.00%       3.939us       1.313us             3  
                                            aten::zero_         0.89%      16.910us        82.13%       1.565ms     521.590us       0.000us         0.00%       3.939us       1.313us             3  
                                            aten::fill_         1.72%      32.820us        81.24%       1.548ms     515.953us       2.882us        11.37%       3.939us       1.313us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.882us        11.37%       2.882us       0.961us             3  
                                Activity Buffer Request        77.24%       1.472ms        77.24%       1.472ms       1.472ms       1.057us         4.17%       1.057us       1.057us             1  
                                            aten::empty         1.76%      33.441us         1.76%      33.441us      11.147us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         3.19%      60.842us         3.19%      60.842us      10.140us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.89%      16.922us         0.89%      16.922us       2.820us       0.000us         0.00%       0.000us       0.000us             6  
                                           aten::select         1.13%      21.591us         1.37%      26.081us       8.694us       0.000us         0.00%       0.000us       0.000us             3  
                                       aten::as_strided         0.24%       4.490us         0.24%       4.490us       1.497us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.39%       7.340us         0.39%       7.340us       7.340us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.905ms
Self CUDA time total: 25.346us



======================================================================
PROFILE TRACE: hf_kernels_deformable_detr | cuda_B1_Q300_H8_E256_L4_P4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             hf_kernels_deformable_detr         0.00%       0.000us         0.00%       0.000us       0.000us     144.191us       546.22%     144.191us     144.191us             1  
                             hf_kernels_deformable_detr         4.39%      75.912us        99.67%       1.722ms       1.722ms       0.000us         0.00%      27.358us      27.358us             1  
       _deformable_detr_57c3d32::ms_deform_attn_forward         2.01%      34.700us        95.28%       1.646ms     548.647us      23.550us        89.21%      27.358us       9.119us             3  
void ms_deformable_im2col_gpu_kernel<float>(int, flo...         0.00%       0.000us         0.00%       0.000us       0.000us      23.550us        89.21%      23.550us       7.850us             3  
                                            aten::zeros         0.49%       8.451us        91.07%       1.573ms     524.424us       0.000us         0.00%       3.808us       1.269us             3  
                                            aten::zero_         0.50%       8.669us        89.54%       1.547ms     515.616us       0.000us         0.00%       3.808us       1.269us             3  
                                            aten::fill_         1.60%      27.701us        89.04%       1.538ms     512.727us       2.848us        10.79%       3.808us       1.269us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.848us        10.79%       2.848us       0.949us             3  
                                Activity Buffer Request        85.90%       1.484ms        85.90%       1.484ms       1.484ms       0.960us         3.64%       0.960us       0.960us             1  
                                            aten::empty         1.04%      17.971us         1.04%      17.971us       5.990us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.40%      41.442us         2.40%      41.442us       6.907us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.54%       9.400us         0.54%       9.400us       1.567us       0.000us         0.00%       0.000us       0.000us             6  
                                           aten::select         0.66%      11.329us         0.79%      13.720us       4.573us       0.000us         0.00%       0.000us       0.000us             3  
                                       aten::as_strided         0.14%       2.391us         0.14%       2.391us       0.797us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.33%       5.680us         0.33%       5.680us       5.680us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.728ms
Self CUDA time total: 26.398us



======================================================================
PROFILE TRACE: hf_kernels_deformable_detr | cuda_B2_Q100_H8_E256_L4_P4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             hf_kernels_deformable_detr         0.00%       0.000us         0.00%       0.000us       0.000us     140.288us       549.37%     140.288us     140.288us             1  
                             hf_kernels_deformable_detr         4.34%      74.492us        99.67%       1.709ms       1.709ms       0.000us         0.00%      26.464us      26.464us             1  
       _deformable_detr_57c3d32::ms_deform_attn_forward         1.96%      33.680us        95.32%       1.635ms     544.984us      22.752us        89.10%      26.464us       8.821us             3  
void ms_deformable_im2col_gpu_kernel<float>(int, flo...         0.00%       0.000us         0.00%       0.000us       0.000us      22.752us        89.10%      22.752us       7.584us             3  
                                            aten::zeros         0.50%       8.650us        91.19%       1.564ms     521.367us       0.000us         0.00%       3.712us       1.237us             3  
                                            aten::zero_         0.47%       8.130us        89.69%       1.538ms     512.773us       0.000us         0.00%       3.712us       1.237us             3  
                                            aten::fill_         1.63%      27.881us        89.21%       1.530ms     510.063us       2.784us        10.90%       3.712us       1.237us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.784us        10.90%       2.784us       0.928us             3  
                                Activity Buffer Request        86.04%       1.476ms        86.04%       1.476ms       1.476ms       0.928us         3.63%       0.928us       0.928us             1  
                                            aten::empty         1.00%      17.131us         1.00%      17.131us       5.710us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.42%      41.510us         2.42%      41.510us       6.918us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.52%       8.991us         0.52%       8.991us       1.498us       0.000us         0.00%       0.000us       0.000us             6  
                                           aten::select         0.62%      10.681us         0.77%      13.291us       4.430us       0.000us         0.00%       0.000us       0.000us             3  
                                       aten::as_strided         0.15%       2.610us         0.15%       2.610us       0.870us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.33%       5.730us         0.33%       5.730us       5.730us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.715ms
Self CUDA time total: 25.536us



======================================================================
PROFILE TRACE: hf_kernels_deformable_detr | cuda_B2_Q300_H8_E256_L4_P4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             hf_kernels_deformable_detr         0.00%       0.000us         0.00%       0.000us       0.000us     151.934us       322.76%     151.934us     151.934us             1  
                             hf_kernels_deformable_detr         3.86%      74.313us        99.75%       1.919ms       1.919ms       0.000us         0.00%      48.129us      48.129us             1  
       _deformable_detr_57c3d32::ms_deform_attn_forward         1.79%      34.420us        95.88%       1.844ms     614.769us      43.968us        93.40%      48.129us      16.043us             3  
void ms_deformable_im2col_gpu_kernel<float>(int, flo...         0.00%       0.000us         0.00%       0.000us       0.000us      43.968us        93.40%      43.968us      14.656us             3  
                                            aten::zeros         0.45%       8.600us        92.03%       1.770ms     590.092us       0.000us         0.00%       4.161us       1.387us             3  
                                            aten::zero_         0.45%       8.690us        90.72%       1.745ms     581.642us       0.000us         0.00%       4.161us       1.387us             3  
                                            aten::fill_         1.44%      27.641us        90.26%       1.736ms     578.745us       3.105us         6.60%       4.161us       1.387us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       3.105us         6.60%       3.105us       1.035us             3  
                                Activity Buffer Request        76.84%       1.478ms        76.84%       1.478ms       1.478ms       1.056us         2.24%       1.056us       1.056us             1  
                                            aten::empty         0.87%      16.750us         0.87%      16.750us       5.583us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        12.74%     245.037us        12.74%     245.037us      40.839us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.49%       9.420us         0.49%       9.420us       1.570us       0.000us         0.00%       0.000us       0.000us             6  
                                           aten::select         0.66%      12.781us         0.82%      15.781us       5.260us       0.000us         0.00%       0.000us       0.000us             3  
                                       aten::as_strided         0.16%       3.000us         0.16%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.25%       4.890us         0.25%       4.890us       4.890us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.924ms
Self CUDA time total: 47.073us


impl                     wl                  p50(ms)  ok
hf_kernels_deformable_detr cuda_B1_Q100_H8_E256_L4_P4     0.04  True
hf_kernels_deformable_detr cuda_B1_Q300_H8_E256_L4_P4     0.05  True
hf_kernels_deformable_detr cuda_B2_Q100_H8_E256_L4_P4     0.05  True
hf_kernels_deformable_detr cuda_B2_Q300_H8_E256_L4_P4     0.05  True
▶ UV Install Logs
Fetching 7 files: 0%| | 0/7 [00:00<?, ?it/s] Fetching 7 files: 14%|█▍ | 1/7 [00:00<00:00, 6.20it/s] Fetching 7 files: 71%|███████▏ | 5/7 [00:00<00:00, 9.26it/s] Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 12.59it/s]