HF Kernels - Deformable DETR

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.22s | Raw GitHub 🤗 HF
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Mon Nov 10 21:58:17 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   28C    P0             79W /  350W |       0MiB /  46068MiB |     11%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Deformable DETR Multi-Scale Deformable Attention Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 8.74s | Raw GitHub 🤗 HF
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the deformable DETR kernel
deformable_detr = get_kernel("kernels-community/deformable-detr")


def hf_kernels_deformable_detr(
    value, spatial_shapes, level_start_index, sampling_locations, attention_weights, im2col_step=64
):
    """HuggingFace Kernels Deformable DETR Multi-Scale Deformable Attention"""
    return deformable_detr.ms_deform_attn_forward(
        value=value,
        spatial_shapes=spatial_shapes,
        level_start_index=level_start_index,
        sampling_loc=sampling_locations,
        attn_weight=attention_weights,
        im2col_step=im2col_step
    )


run_benchmark(
    kernel_type=KernelTypeEnum.DEFORMABLE_DETR,
    impl_name="hf_kernels_deformable_detr",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_deformable_detr,
    dtype="float32",
)
Running deformable_detr benchmark on cuda with 4 workloads.

======================================================================
PROFILE TRACE: hf_kernels_deformable_detr | cuda_B1_Q100_H8_E256_L4_P4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             hf_kernels_deformable_detr         0.00%       0.000us         0.00%       0.000us       0.000us     189.823us       748.99%     189.823us     189.823us             1  
                             hf_kernels_deformable_detr         6.28%     137.822us        99.65%       2.188ms       2.188ms       0.000us         0.00%      26.400us      26.400us             1  
       _deformable_detr_57c3d32::ms_deform_attn_forward         3.04%      66.841us        93.38%       2.051ms     683.551us      22.496us        88.76%      26.400us       8.800us             3  
void ms_deformable_im2col_gpu_kernel<float>(int, flo...         0.00%       0.000us         0.00%       0.000us       0.000us      22.496us        88.76%      22.496us       7.499us             3  
                                            aten::zeros         0.83%      18.191us        87.50%       1.922ms     640.537us       0.000us         0.00%       3.904us       1.301us             3  
                                            aten::zero_         0.64%      14.160us        85.08%       1.868ms     622.823us       0.000us         0.00%       3.904us       1.301us             3  
                                            aten::fill_         1.45%      31.860us        84.44%       1.854ms     618.103us       2.848us        11.24%       3.904us       1.301us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.848us        11.24%       2.848us       0.949us             3  
                                Activity Buffer Request        80.97%       1.778ms        80.97%       1.778ms       1.778ms       1.056us         4.17%       1.056us       1.056us             1  
                                            aten::empty         1.59%      34.950us         1.59%      34.950us      11.650us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.83%      62.083us         2.83%      62.083us      10.347us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.81%      17.870us         0.81%      17.870us       2.978us       0.000us         0.00%       0.000us       0.000us             6  
                                           aten::select         1.01%      22.200us         1.21%      26.600us       8.867us       0.000us         0.00%       0.000us       0.000us             3  
                                       aten::as_strided         0.20%       4.400us         0.20%       4.400us       1.467us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.35%       7.640us         0.35%       7.640us       7.640us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.196ms
Self CUDA time total: 25.344us



======================================================================
PROFILE TRACE: hf_kernels_deformable_detr | cuda_B1_Q300_H8_E256_L4_P4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             hf_kernels_deformable_detr         0.00%       0.000us         0.00%       0.000us       0.000us     134.592us       507.36%     134.592us     134.592us             1  
                             hf_kernels_deformable_detr         3.69%      73.590us        99.72%       1.986ms       1.986ms       0.000us         0.00%      27.456us      27.456us             1  
       _deformable_detr_57c3d32::ms_deform_attn_forward         1.62%      32.200us        96.02%       1.913ms     637.550us      23.712us        89.38%      27.456us       9.152us             3  
void ms_deformable_im2col_gpu_kernel<float>(int, flo...         0.00%       0.000us         0.00%       0.000us       0.000us      23.712us        89.38%      23.712us       7.904us             3  
                                            aten::zeros         0.41%       8.111us        92.57%       1.844ms     614.623us       0.000us         0.00%       3.744us       1.248us             3  
                                            aten::zero_         0.44%       8.741us        91.34%       1.819ms     606.446us       0.000us         0.00%       3.744us       1.248us             3  
                                            aten::fill_         1.32%      26.360us        90.90%       1.811ms     603.533us       2.816us        10.62%       3.744us       1.248us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.816us        10.62%       2.816us       0.939us             3  
                                Activity Buffer Request        88.30%       1.759ms        88.30%       1.759ms       1.759ms       0.928us         3.50%       0.928us       0.928us             1  
                                            aten::empty         0.82%      16.420us         0.82%      16.420us       5.473us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.00%      39.862us         2.00%      39.862us       6.644us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.45%       9.050us         0.45%       9.050us       1.508us       0.000us         0.00%       0.000us       0.000us             6  
                                           aten::select         0.54%      10.840us         0.66%      13.190us       4.397us       0.000us         0.00%       0.000us       0.000us             3  
                                       aten::as_strided         0.12%       2.350us         0.12%       2.350us       0.783us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.28%       5.611us         0.28%       5.611us       5.611us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.992ms
Self CUDA time total: 26.528us



======================================================================
PROFILE TRACE: hf_kernels_deformable_detr | cuda_B2_Q100_H8_E256_L4_P4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             hf_kernels_deformable_detr         0.00%       0.000us         0.00%       0.000us       0.000us     138.240us       537.98%     138.240us     138.240us             1  
                             hf_kernels_deformable_detr         3.56%      70.651us        99.71%       1.981ms       1.981ms       0.000us         0.00%      26.624us      26.624us             1  
       _deformable_detr_57c3d32::ms_deform_attn_forward         1.67%      33.240us        96.15%       1.910ms     636.753us      22.912us        89.17%      26.624us       8.875us             3  
void ms_deformable_im2col_gpu_kernel<float>(int, flo...         0.00%       0.000us         0.00%       0.000us       0.000us      22.912us        89.17%      22.912us       7.637us             3  
                                            aten::zeros         0.41%       8.110us        92.55%       1.839ms     612.899us       0.000us         0.00%       3.712us       1.237us             3  
                                            aten::zero_         0.40%       7.959us        91.32%       1.814ms     604.749us       0.000us         0.00%       3.712us       1.237us             3  
                                            aten::fill_         1.22%      24.170us        90.92%       1.806ms     602.096us       2.784us        10.83%       3.712us       1.237us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.784us        10.83%       2.784us       0.928us             3  
                                Activity Buffer Request        88.35%       1.755ms        88.35%       1.755ms       1.755ms       0.928us         3.61%       0.928us       0.928us             1  
                                            aten::empty         0.82%      16.340us         0.82%      16.340us       5.447us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.09%      41.501us         2.09%      41.501us       6.917us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.44%       8.661us         0.44%       8.661us       1.444us       0.000us         0.00%       0.000us       0.000us             6  
                                           aten::select         0.62%      12.301us         0.75%      14.971us       4.990us       0.000us         0.00%       0.000us       0.000us             3  
                                       aten::as_strided         0.13%       2.670us         0.13%       2.670us       0.890us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.29%       5.820us         0.29%       5.820us       5.820us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.987ms
Self CUDA time total: 25.696us



======================================================================
PROFILE TRACE: hf_kernels_deformable_detr | cuda_B2_Q300_H8_E256_L4_P4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             hf_kernels_deformable_detr         0.00%       0.000us         0.00%       0.000us       0.000us     151.169us       321.37%     151.169us     151.169us             1  
                             hf_kernels_deformable_detr         3.15%      71.770us        99.78%       2.275ms       2.275ms       0.000us         0.00%      48.031us      48.031us             1  
       _deformable_detr_57c3d32::ms_deform_attn_forward         1.55%      35.341us        96.63%       2.204ms     734.529us      44.000us        93.54%      48.031us      16.010us             3  
void ms_deformable_im2col_gpu_kernel<float>(int, flo...         0.00%       0.000us         0.00%       0.000us       0.000us      44.000us        93.54%      44.000us      14.667us             3  
                                            aten::zeros         0.38%       8.571us        93.48%       2.132ms     710.555us       0.000us         0.00%       4.031us       1.344us             3  
                                            aten::zero_         0.42%       9.580us        92.38%       2.107ms     702.221us       0.000us         0.00%       4.031us       1.344us             3  
                                            aten::fill_         1.16%      26.560us        91.96%       2.097ms     699.028us       3.039us         6.46%       4.031us       1.344us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       3.039us         6.46%       3.039us       1.013us             3  
                                Activity Buffer Request        80.85%       1.844ms        80.85%       1.844ms       1.844ms       0.992us         2.11%       0.992us       0.992us             1  
                                            aten::empty         0.72%      16.430us         0.72%      16.430us       5.477us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        10.56%     240.915us        10.56%     240.915us      40.153us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::view         0.41%       9.238us         0.41%       9.238us       1.540us       0.000us         0.00%       0.000us       0.000us             6  
                                           aten::select         0.48%      10.832us         0.58%      13.262us       4.421us       0.000us         0.00%       0.000us       0.000us             3  
                                       aten::as_strided         0.11%       2.430us         0.11%       2.430us       0.810us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.22%       4.990us         0.22%       4.990us       4.990us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.280ms
Self CUDA time total: 47.039us


impl                     wl                  p50(ms)  ok
hf_kernels_deformable_detr cuda_B1_Q100_H8_E256_L4_P4     0.03  True
hf_kernels_deformable_detr cuda_B1_Q300_H8_E256_L4_P4     0.04  True
hf_kernels_deformable_detr cuda_B2_Q100_H8_E256_L4_P4     0.04  True
hf_kernels_deformable_detr cuda_B2_Q300_H8_E256_L4_P4     0.05  True
▶ UV Install Logs
Fetching 7 files: 0%| | 0/7 [00:00<?, ?it/s] Fetching 7 files: 71%|███████▏ | 5/7 [00:00<00:00, 9.96it/s] Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 13.94it/s]