▼ code ▼ output ▶ uv-logs | Cell: nv | 0.53s | Raw

import subprocess

print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)

▼ code ▼ output ▶ uv-logs | Cell: setup | 133.12s | Raw

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "accelerate>=1.10.1",
#     "torch>=2.7.0",
#     "kernels==0.10.0",
#     "transformers@https://github.com/huggingface/transformers.git",
#     "ipdb>=0.13.13",
#     "matplotlib>=3.7.2",
#     "numpy>=1.24.3",
# ]
# ///

import torch
from transformers import GptOssForCausalLM, PreTrainedTokenizerFast, Mxfp4Config
import time
import torch.nn as nn
from kernels import register_kernel_mapping, Mode, LayerRepository
import sys
import torch.profiler
import gc
import logging

# set to debug logging
logging.basicConfig(level=logging.INFO)

def reset_peak_memory_stats():
    """Clear CUDA cache and reset memory allocation counters."""
    torch.cuda.empty_cache()
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
    gc.collect()

def get_memory_stats():
    """Get current and peak CUDA memory usage."""
    if not torch.cuda.is_available():
        return {"allocated_gb": 0, "peak_gb": 0, "reserved_gb": 0}
    return {
        "allocated_gb": torch.cuda.memory_allocated() / 1e9,
        "peak_gb": torch.cuda.max_memory_allocated() / 1e9,
        "reserved_gb": torch.cuda.memory_reserved() / 1e9,
    }

def override_kernel_layer_name(cls_name: str, value) -> bool:
    """Helper to dynamically override the kernel_layer_name in a model class."""
    for mod in sys.modules.values():
        if mod is None:
            continue
        obj = getattr(mod, cls_name, None)
        if isinstance(obj, type) and issubclass(obj, nn.Module):
            setattr(obj, "kernel_layer_name", value)
            print(f"Overrode {cls_name}.kernel_layer_name to {value}")
            return True
    return False


# Init the model the normal way
model_id = "openai/gpt-oss-20b"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_id)
quantization_config = Mxfp4Config(dequantize=True)


from kernels import replace_kernel_forward_from_hub, register_kernel_mapping, LayerRepository, Mode

from transformers.models.gpt_oss.modeling_gpt_oss import GptOssMLP, GptOssRMSNorm

replace_kernel_forward_from_hub(GptOssMLP, "Yamoe")  # direct, type-safe
replace_kernel_forward_from_hub(GptOssRMSNorm, None)  # direct, type-safe
custom_mapping = {
    "Yamoe": {
        "cuda": {
            Mode.INFERENCE: LayerRepository(
                repo_id="drbh/yamoe",
                layer_name="Yamoe",
                revision="v0.3.0",
            )
        }
    }
}
register_kernel_mapping(custom_mapping)


model = GptOssForCausalLM.from_pretrained(
    model_id,
    dtype="bfloat16",
    device_map="auto",
    use_kernels=True,
    quantization_config=quantization_config,
).eval()

messages = [
    {"role": "system", "content": "What is Tensor Parallelism?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    reasoning_effort="low",
).to("cuda")

max_tokens = 512

with torch.inference_mode():
    start_time = time.perf_counter()
    generated = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        do_sample=False,
        temperature=None,
    )
    end_time = time.perf_counter()

print(tokenizer.decode(generated[0], skip_special_tokens=False))
print(f"Generation took {end_time - start_time:.2f} seconds")

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-09-23 Reasoning: low # Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions What is Tensor Parallelism? <|end|><|start|>assistant<|channel|>analysis<|message|>We need to explain what Tensor Parallelism is. It's a concept in distributed training of large language models. It refers to splitting the weight matrices (tensors) across multiple devices, so each device holds a slice of the matrix. During forward/backward passes, each device computes partial results and then they are aggregated. It's used to scale up models beyond single device memory. Also mention pipeline parallelism, data parallelism. Provide details: e.g., for a linear layer weight matrix W of shape (out_features, in_features), we can split along out_features dimension across devices. Each device computes its part of the output. Then gather results. In backward, gradients are computed locally and then aggregated. Provide example: GPT-3 training uses tensor parallelism. Also mention frameworks: Megatron-LM, DeepSpeed, etc. Provide pros/cons. Provide code snippet maybe. Also mention that it's different from data parallelism. Provide explanation of how it works in practice. Provide mention of communication overhead. Provide mention of "tensor model parallelism" vs "tensor parallelism" synonyms. Provide mention of "tensor parallelism" in context of huggingface accelerate. Provide mention of "tensor parallelism" in context of DeepSpeed ZeRO stage 3. Provide mention of "tensor parallelism" in context of Megatron-LM. Provide mention of "tensor parallelism" in context of GPT-NeoX. Provide mention of "tensor parallelism" in context of "DeepSpeed's ZeRO-Offload" maybe. Provide mention of "tensor parallelism" in context of "DeepSpeed's ZeRO-2" maybe. Provide mention of "tensor parallelism" in context of "DeepSpeed's ZeRO-3" maybe. Provide mention of "tensor parallelism" in context of "DeepSpeed's ZeRO-3" maybe. Provide mention of "tensor parallelism" in context of "DeepSpeed's ZeRO-3" maybe. Provide mention of "tensor parallelism" in context of "DeepSpeed's ZeRO-3" maybe. Provide mention of "tensor parallelism" in context of "DeepSpeed's ZeRO-3" maybe. Provide mention of "tensor parallelism" in context of "DeepSpeed's ZeRO-3" maybe. Provide mention of "tensor parallelism" in context of "DeepSpeed's ZeRO-3" maybe. Provide mention of "tensor parallelism" in context of "DeepSpeed's ZeRO-3" maybe. Provide mention of "tensor Generation took 51.92 seconds

▶ UV Install Logs

Downloading cpython-3.13.7-linux-x86_64-gnu (download) (32.0MiB) Downloading cpython-3.13.7-linux-x86_64-gnu (download) Updating https://github.com/huggingface/transformers.git (HEAD) Updated https://github.com/huggingface/transformers.git (99b0995138c17ef953959c70f35cb2bdc41111a2) Building transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2 Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB) Downloading jedi (1.5MiB) Downloading nvidia-cusparse-cu12 (274.9MiB) Downloading hf-xet (3.0MiB) Downloading nvidia-cusparselt-cu12 (273.9MiB) Downloading nvidia-cuda-cupti-cu12 (9.8MiB) Downloading nvidia-curand-cu12 (60.7MiB) Downloading nvidia-nvjitlink-cu12 (37.4MiB) Downloading nvidia-cufft-cu12 (184.2MiB) Downloading nvidia-cusolver-cu12 (255.1MiB) Downloading pillow (6.3MiB) Downloading sympy (6.0MiB) Downloading fonttools (4.7MiB) Downloading numpy (15.9MiB) Downloading triton (148.4MiB) Downloading networkx (1.9MiB) Downloading tokenizers (3.1MiB) Downloading pygments (1.2MiB) Downloading matplotlib (8.3MiB) Downloading nvidia-cublas-cu12 (566.8MiB) Downloading nvidia-cufile-cu12 (1.1MiB) Downloading nvidia-cudnn-cu12 (674.0MiB) Downloading kiwisolver (1.4MiB) Downloading nvidia-nccl-cu12 (307.4MiB) Downloading torch (846.8MiB) Downloading nvidia-cufile-cu12 Downloading kiwisolver Downloading pygments Downloading hf-xet Downloading tokenizers Downloading networkx Downloading fonttools Downloading pillow Downloading matplotlib Downloading nvidia-cuda-cupti-cu12 Downloading numpy Downloading sympy Built transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2 Downloading nvidia-nvjitlink-cu12 Downloading jedi Downloading nvidia-curand-cu12 Downloading nvidia-cuda-nvrtc-cu12 Downloading triton Downloading nvidia-cufft-cu12 Downloading nvidia-cusolver-cu12 Downloading nvidia-cusparselt-cu12 Downloading nvidia-cusparse-cu12 Downloading nvidia-nccl-cu12 Downloading nvidia-cublas-cu12 Downloading nvidia-cudnn-cu12 Downloading torch Installed 69 packages in 467ms

Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s] Fetching 3 files: 33%|███▎ | 1/3 [00:06<00:13, 6.78s/it] Fetching 3 files: 67%|██████▋ | 2/3 [00:08<00:03, 3.65s/it] Fetching 3 files: 100%|██████████| 3/3 [00:08<00:00, 2.75s/it] You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:02<00:04, 2.34s/it] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04<00:02, 2.25s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00, 1.80s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00, 1.93s/it] INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` Fetching 6 files: 0%| | 0/6 [00:00<?, ?it/s] Fetching 6 files: 17%|█▋ | 1/6 [00:00<00:01, 3.89it/s] Fetching 6 files: 100%|██████████| 6/6 [00:00<00:00, 17.67it/s] /tmp/uvnote-run-hvgovjfd/home/.cache/uv/environments-v2/setup-443c07e337d3be43/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning: No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation. warnings.warn( INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` /tmp/uvnote-run-hvgovjfd/home/.cache/uv/environments-v2/setup-443c07e337d3be43/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning: No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation. warnings.warn( INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`

Reference kernel

▼ code ▼ output ▶ uv-logs | Cell: setup2 | 139.97s | Raw

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "accelerate>=1.10.1",
#     "torch>=2.7.0",
#     "kernels==0.10.0",
#     "transformers@https://github.com/huggingface/transformers.git",
#     "ipdb>=0.13.13",
#     "matplotlib>=3.7.2",
#     "numpy>=1.24.3",
# ]
# ///

import torch
from transformers import GptOssForCausalLM, PreTrainedTokenizerFast, Mxfp4Config
import time
import torch.nn as nn
from kernels import register_kernel_mapping, Mode, LayerRepository
import sys
import torch.profiler
import gc
import logging

# set to debug logging
logging.basicConfig(level=logging.INFO)

def reset_peak_memory_stats():
    """Clear CUDA cache and reset memory allocation counters."""
    torch.cuda.empty_cache()
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
    gc.collect()

def get_memory_stats():
    """Get current and peak CUDA memory usage."""
    if not torch.cuda.is_available():
        return {"allocated_gb": 0, "peak_gb": 0, "reserved_gb": 0}
    return {
        "allocated_gb": torch.cuda.memory_allocated() / 1e9,
        "peak_gb": torch.cuda.max_memory_allocated() / 1e9,
        "reserved_gb": torch.cuda.memory_reserved() / 1e9,
    }

def override_kernel_layer_name(cls_name: str, value) -> bool:
    """Helper to dynamically override the kernel_layer_name in a model class."""
    for mod in sys.modules.values():
        if mod is None:
            continue
        obj = getattr(mod, cls_name, None)
        if isinstance(obj, type) and issubclass(obj, nn.Module):
            setattr(obj, "kernel_layer_name", value)
            print(f"Overrode {cls_name}.kernel_layer_name to {value}")
            return True
    return False


# Init the model the normal way
model_id = "openai/gpt-oss-20b"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_id)
quantization_config = Mxfp4Config(dequantize=True)


from kernels import replace_kernel_forward_from_hub, register_kernel_mapping, LayerRepository, Mode

from transformers.models.gpt_oss.modeling_gpt_oss import GptOssMLP, GptOssRMSNorm

replace_kernel_forward_from_hub(GptOssRMSNorm, None)  # direct, type-safe
custom_mapping = {
    "Yamoe": {
        "cuda": {
            Mode.INFERENCE: LayerRepository(
                repo_id="drbh/yamoe",
                layer_name="Yamoe",
                revision="v0.3.0",
            )
        }
    }
}
register_kernel_mapping(custom_mapping)


model = GptOssForCausalLM.from_pretrained(
    model_id,
    dtype="bfloat16",
    device_map="auto",
    use_kernels=True,
    quantization_config=quantization_config,
).eval()

messages = [
    {"role": "system", "content": "What is Tensor Parallelism?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    reasoning_effort="low",
).to("cuda")

max_tokens = 512

with torch.inference_mode():
    start_time = time.perf_counter()
    generated = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        do_sample=False,
        temperature=None,
    )
    end_time = time.perf_counter()

print(tokenizer.decode(generated[0], skip_special_tokens=False))
print(f"Generation took {end_time - start_time:.2f} seconds")

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-09-23 Reasoning: low # Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions What is Tensor Parallelism? <|end|><|start|>assistant<|channel|>analysis<|message|>We need to explain what Tensor Parallelism is. It's a concept in distributed training of large language models. It refers to splitting the weight matrices (tensors) across multiple devices. Provide details: how it works, benefits, challenges, typical frameworks, etc. Also mention difference from data parallelism, pipeline parallelism. Provide example: splitting a weight matrix across GPUs, each GPU holds a slice, compute partial results, then gather. Provide mention of communication overhead, scaling, etc. Also mention that it's used in large models like GPT-3, Megatron-LM, DeepSpeed, etc. Provide explanation of how it reduces memory usage, increases throughput. Provide mention of "tensor model parallelism" vs "tensor parallelism" synonyms. Provide mention of "tensor parallelism" in context of huggingface accelerate, DeepSpeed, Megatron. Provide mention of "tensor parallelism" in the context of "tensor parallelism" in the "DeepSpeed ZeRO-Offload" or "ZeRO-3" etc. Provide mention of "tensor parallelism" in the context of "tensor parallelism" in "DeepSpeed" and "Megatron-LM" and "DeepSpeed's ZeRO" and "DeepSpeed's ZeRO-3" and "DeepSpeed's ZeRO-2" etc. Provide mention of "tensor parallelism" in the context of "tensor parallelism" in "DeepSpeed's ZeRO-3" and "DeepSpeed's ZeRO-2" etc. Provide mention of "tensor parallelism" in the context of "tensor parallelism" in "DeepSpeed's ZeRO-3" and "DeepSpeed's ZeRO-2" etc. Provide mention of "tensor parallelism" in the context of "tensor parallelism" in "DeepSpeed's ZeRO-3" and "DeepSpeed's ZeRO-2" etc. Provide mention of "tensor parallelism" in the context of "tensor parallelism" in "DeepSpeed's ZeRO-3" and "DeepSpeed's ZeRO-2" etc. Provide mention of "tensor parallelism" in the context of "tensor parallelism" in "DeepSpeed's ZeRO-3" and "DeepSpeed's ZeRO-2" etc. Provide mention of "tensor parallelism" in the context of "tensor parallelism" in "DeepSpeed's ZeRO-3" and "DeepSpeed's ZeRO-2" etc. Provide mention of "tensor parallelism" in the Generation took 57.98 seconds

▶ UV Install Logs

Downloading cpython-3.13.7-linux-x86_64-gnu (download) (32.0MiB) Downloading cpython-3.13.7-linux-x86_64-gnu (download) Updating https://github.com/huggingface/transformers.git (HEAD) Updated https://github.com/huggingface/transformers.git (99b0995138c17ef953959c70f35cb2bdc41111a2) Downloading jedi (1.5MiB) Downloading pygments (1.2MiB) Downloading nvidia-cufile-cu12 (1.1MiB) Building transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2 Downloading hf-xet (3.0MiB) Downloading nvidia-cuda-cupti-cu12 (9.8MiB) Downloading numpy (15.9MiB) Downloading sympy (6.0MiB) Downloading matplotlib (8.3MiB) Downloading nvidia-cudnn-cu12 (674.0MiB) Downloading networkx (1.9MiB) Downloading nvidia-nvjitlink-cu12 (37.4MiB) Downloading pillow (6.3MiB) Downloading nvidia-cublas-cu12 (566.8MiB) Downloading tokenizers (3.1MiB) Downloading nvidia-cusolver-cu12 (255.1MiB) Downloading nvidia-cusparse-cu12 (274.9MiB) Downloading nvidia-curand-cu12 (60.7MiB) Downloading nvidia-cusparselt-cu12 (273.9MiB) Downloading nvidia-nccl-cu12 (307.4MiB) Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB) Downloading fonttools (4.7MiB) Downloading nvidia-cufft-cu12 (184.2MiB) Downloading triton (148.4MiB) Downloading kiwisolver (1.4MiB) Downloading torch (846.8MiB) Downloading nvidia-cufile-cu12 Downloading kiwisolver Downloading pygments Downloading hf-xet Downloading tokenizers Downloading networkx Downloading fonttools Downloading pillow Downloading matplotlib Downloading nvidia-cuda-cupti-cu12 Downloading numpy Downloading sympy Built transformers @ git+https://github.com/huggingface/transformers.git@99b0995138c17ef953959c70f35cb2bdc41111a2 Downloading nvidia-nvjitlink-cu12 Downloading jedi Downloading nvidia-curand-cu12 Downloading nvidia-cuda-nvrtc-cu12 Downloading triton Downloading nvidia-cufft-cu12 Downloading nvidia-cusolver-cu12 Downloading nvidia-cusparselt-cu12 Downloading nvidia-cusparse-cu12 Downloading nvidia-nccl-cu12 Downloading nvidia-cublas-cu12 Downloading nvidia-cudnn-cu12 Downloading torch Installed 69 packages in 468ms

Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s] Fetching 3 files: 33%|███▎ | 1/3 [00:06<00:12, 6.38s/it] Fetching 3 files: 67%|██████▋ | 2/3 [00:08<00:03, 3.61s/it] Fetching 3 files: 100%|██████████| 3/3 [00:08<00:00, 2.69s/it] You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:02<00:04, 2.34s/it] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04<00:02, 2.25s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00, 1.80s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00, 1.93s/it] INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` Fetching 66 files: 0%| | 0/66 [00:00<?, ?it/s] Fetching 66 files: 2%|▏ | 1/66 [00:00<00:10, 6.10it/s] Fetching 66 files: 14%|█▎ | 9/66 [00:00<00:01, 30.47it/s] Fetching 66 files: 24%|██▍ | 16/66 [00:00<00:01, 37.56it/s] Fetching 66 files: 30%|███ | 20/66 [00:01<00:03, 14.24it/s] Fetching 66 files: 67%|██████▋ | 44/66 [00:01<00:00, 37.14it/s] Fetching 66 files: 91%|█████████ | 60/66 [00:01<00:00, 49.97it/s] Fetching 66 files: 100%|██████████| 66/66 [00:01<00:00, 36.02it/s] /tmp/uvnote-run-nw4e52ut/home/.cache/uv/environments-v2/setup2-69adf76231e4ab4f/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning: No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation. warnings.warn( INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` /tmp/uvnote-run-nw4e52ut/home/.cache/uv/environments-v2/setup2-69adf76231e4ab4f/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning: No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation. warnings.warn( INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`