Loading weights from: /repo/moe_benchmarks/megablocks_yamoe/.uvnote/cache/f8744f31d9cf720409852d42748815c6d61f005a2a9b297b7b9bf986ed98bb90
Loaded shared weights from artifacts
Router weight sum: 12.588732
Gate/up sum: 1026.601807
Down sum: 206.729263
=== MegaBlocks Implementation ===
[MegaBlocks] Router weight sum: 12.588732
[MegaBlocks] Gate/up projection shape: (128, 1152, 2304), sum: 1026.601807
[MegaBlocks] Down projection shape: (128, 1152, 1152), sum: 206.729340
┌─ Benchmark Configuration ─────────────────────────────┐
│ Warmup: 10 Iters: 50 │
│ Tokens: 100 │
│ Input Variation: Enabled (prevents caching artifacts) │
└────────────────────────────────────────────────────────┘
Base Input: shape=(1, 100, 1152), dtype=torch.float32, device=cuda:0, range=[-0.486445, 0.446746], mean=-0.000048, std=0.099986, norm=33.936142
Input Variation: +0.001 * iteration (deterministic)
Warming up (10 iterations)...
Downloading numpy (16.2MiB)
Downloading sympy (6.0MiB)
Downloading setuptools (1.1MiB)
Downloading nvidia-cudnn-cu12 (674.0MiB)
Downloading nvidia-curand-cu12 (60.7MiB)
Downloading nvidia-cuda-cupti-cu12 (9.8MiB)
Downloading hf-xet (3.0MiB)
Downloading nvidia-nvjitlink-cu12 (37.4MiB)
Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB)
Downloading nvidia-cublas-cu12 (566.8MiB)
Downloading nvidia-cusolver-cu12 (255.1MiB)
Downloading nvidia-nccl-cu12 (307.4MiB)
Downloading nvidia-cufft-cu12 (184.2MiB)
Downloading triton (148.3MiB)
Downloading nvidia-cusparse-cu12 (274.9MiB)
Downloading nvidia-cusparselt-cu12 (273.9MiB)
Downloading nvidia-cufile-cu12 (1.1MiB)
Downloading torch (846.9MiB)
Downloading networkx (1.9MiB)
Downloading nvidia-cufile-cu12
Downloading hf-xet
Downloading setuptools
Downloading networkx
Downloading nvidia-cuda-cupti-cu12
Downloading numpy
Downloading sympy
Downloading nvidia-nvjitlink-cu12
Downloading nvidia-curand-cu12
Downloading nvidia-cuda-nvrtc-cu12
Downloading triton
Downloading nvidia-cufft-cu12
Downloading nvidia-cusolver-cu12
Downloading nvidia-cusparselt-cu12
Downloading nvidia-cusparse-cu12
Downloading nvidia-nccl-cu12
Downloading nvidia-cublas-cu12
Downloading nvidia-cudnn-cu12
Downloading torch
Installed 37 packages in 448ms
Fetching 66 files: 0%| | 0/66 [00:00<?, ?it/s]
Fetching 66 files: 2%|▏ | 1/66 [00:00<00:28, 2.31it/s]
Fetching 66 files: 14%|█▎ | 9/66 [00:00<00:03, 18.19it/s]
Fetching 66 files: 26%|██▌ | 17/66 [00:01<00:02, 16.61it/s]
Fetching 66 files: 52%|█████▏ | 34/66 [00:01<00:00, 38.17it/s]
Fetching 66 files: 64%|██████▎ | 42/66 [00:01<00:00, 36.62it/s]
Fetching 66 files: 73%|███████▎ | 48/66 [00:01<00:00, 28.57it/s]
Fetching 66 files: 92%|█████████▏| 61/66 [00:01<00:00, 39.67it/s]
Fetching 66 files: 100%|██████████| 66/66 [00:02<00:00, 32.91it/s]
/tmp/tmp1397kafx/cuda_utils.c:5:10: fatal error: Python.h: No such file or directory
5 | #include <Python.h>
| ^~~~~~~~~~
compilation terminated.
Traceback (most recent call last):
File "/repo/moe_benchmarks/megablocks_yamoe/.uvnote/cells/megablocks_run.py", line 102, in <module>
output, stats = bench(model, x)
^^^^^^^^^^^^^^^
File "/repo/moe_benchmarks/megablocks_yamoe/.uvnote/cells/bench_utils.py", line 189, in runner
result, times_s = _bench_engine(call, warmup=warmup, iters=iters, device=device, dtype=dtype, input_gen=input_gen)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/repo/moe_benchmarks/megablocks_yamoe/.uvnote/cells/bench_utils.py", line 96, in _bench_engine
_ = call(input_gen())
^^^^^^^^^^^^^^^^^
File "/repo/moe_benchmarks/megablocks_yamoe/.uvnote/cells/bench_utils.py", line 177, in <lambda>
call = lambda x: fn(x, *args[1:], **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/repo/moe_benchmarks/megablocks_yamoe/.uvnote/cells/megablocks_run.py", line 81, in forward
output, dummy_routing_weights = self.model(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/huggingface/hub/models--kernels-community--megablocks/snapshots/e0fb1437de3f8d7079c4da13be8cb64dc0cfcdd5/build/torch28-cxx11-cu128-x86_64-linux/megablocks/layers.py", line 896, in forward
output, expert_weights_out, *_ = moe_forward(
^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/huggingface/hub/models--kernels-community--megablocks/snapshots/e0fb1437de3f8d7079c4da13be8cb64dc0cfcdd5/build/torch28-cxx11-cu128-x86_64-linux/megablocks/layers.py", line 730, in moe_forward
x, tokens_per_expert = forward_fn(**forward_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/huggingface/hub/models--kernels-community--megablocks/snapshots/e0fb1437de3f8d7079c4da13be8cb64dc0cfcdd5/build/torch28-cxx11-cu128-x86_64-linux/megablocks/layers.py", line 457, in forward_once
x = permute_and_compute(
^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/huggingface/hub/models--kernels-community--megablocks/snapshots/e0fb1437de3f8d7079c4da13be8cb64dc0cfcdd5/build/torch28-cxx11-cu128-x86_64-linux/megablocks/layers.py", line 401, in permute_and_compute
x = ops.binned_gather(x, indices, bins, expert_capacity, top_k)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/torch/autograd/function.py", line 576, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/huggingface/hub/models--kernels-community--megablocks/snapshots/e0fb1437de3f8d7079c4da13be8cb64dc0cfcdd5/build/torch28-cxx11-cu128-x86_64-linux/megablocks/ops/stk_autocast.py", line 30, in decorate_fwd
return fwd(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/huggingface/hub/models--kernels-community--megablocks/snapshots/e0fb1437de3f8d7079c4da13be8cb64dc0cfcdd5/build/torch28-cxx11-cu128-x86_64-linux/megablocks/ops/binned_gather.py", line 26, in forward
return kernels.binned_gather(x, indices, None, bins, bin_size, top_k)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/huggingface/hub/models--kernels-community--megablocks/snapshots/e0fb1437de3f8d7079c4da13be8cb64dc0cfcdd5/build/torch28-cxx11-cu128-x86_64-linux/megablocks/backend/kernels.py", line 419, in binned_gather
_binned_copy[(num_experts, expert_capacity)](
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/runtime/jit.py", line 390, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 239, in run
benchmark()
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 228, in benchmark
timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 228, in <dictcomp>
timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 160, in _bench
return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
^^^^^^^^^^^^^
File "/usr/lib/python3.11/functools.py", line 1001, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 121, in do_bench
return driver.active.get_benchmarker()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/runtime/driver.py", line 30, in __getattr__
return getattr(self._initialize_obj(), name)
^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/runtime/driver.py", line 26, in _initialize_obj
self._obj = self._init_fn()
^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/runtime/driver.py", line 12, in _create_driver
return active_drivers[0]()
^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 715, in __init__
self.utils = CudaUtils() # TODO: make static
^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 62, in __init__
mod = compile_module_from_src(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/runtime/build.py", line 88, in compile_module_from_src
so = _build(name, src_path, tmpdir, library_dirs or [], include_dirs or [], libraries or [])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/runtime/build.py", line 51, in _build
subprocess.check_call(cc_cmd, stdout=subprocess.DEVNULL)
File "/usr/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp1397kafx/cuda_utils.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmp1397kafx/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib/x86_64-linux-gnu', '-I/tmp/uvnote-run-g9v2jr6r/home/.cache/uv/environments-v2/megablocks-run-8802ebf6d3566120/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp1397kafx', '-I/usr/include/python3.11']' returned non-zero exit status 1.