▼ code ▼ output ▶ uv-logs | Cell: setup | 191.24s | Raw

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "accelerate>=1.10.1",
#     "torch>=2.7.0",
#     "kernels==0.10.0",
#     "transformers@https://github.com/huggingface/transformers.git",
#     "ipdb>=0.13.13",
#     "matplotlib>=3.7.2",
#     "numpy>=1.24.3",
# ]
# ///

import torch
from transformers import GptOssForCausalLM, PreTrainedTokenizerFast, Mxfp4Config
import time
import torch.nn as nn
from kernels import register_kernel_mapping, Mode, LayerRepository
import sys
import torch.profiler
import gc
import logging

# set to debug logging
logging.basicConfig(level=logging.INFO)

def reset_peak_memory_stats():
    """Clear CUDA cache and reset memory allocation counters."""
    torch.cuda.empty_cache()
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
    gc.collect()

def get_memory_stats():
    """Get current and peak CUDA memory usage."""
    if not torch.cuda.is_available():
        return {"allocated_gb": 0, "peak_gb": 0, "reserved_gb": 0}
    return {
        "allocated_gb": torch.cuda.memory_allocated() / 1e9,
        "peak_gb": torch.cuda.max_memory_allocated() / 1e9,
        "reserved_gb": torch.cuda.memory_reserved() / 1e9,
    }

def override_kernel_layer_name(cls_name: str, value) -> bool:
    """Helper to dynamically override the kernel_layer_name in a model class."""
    for mod in sys.modules.values():
        if mod is None:
            continue
        obj = getattr(mod, cls_name, None)
        if isinstance(obj, type) and issubclass(obj, nn.Module):
            setattr(obj, "kernel_layer_name", value)
            print(f"Overrode {cls_name}.kernel_layer_name to {value}")
            return True
    return False


# Init the model the normal way
model_id = "openai/gpt-oss-20b"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_id)
quantization_config = Mxfp4Config(dequantize=True)


from kernels import replace_kernel_forward_from_hub, register_kernel_mapping, LayerRepository, Mode

from transformers.models.gpt_oss.modeling_gpt_oss import GptOssMLP, GptOssRMSNorm

replace_kernel_forward_from_hub(GptOssMLP, "Yamoe")  # direct, type-safe
replace_kernel_forward_from_hub(GptOssRMSNorm, None)  # direct, type-safe
custom_mapping = {
    "Yamoe": {
        "cuda": {
            Mode.INFERENCE: LayerRepository(
                repo_id="drbh/yamoe",
                layer_name="Yamoe",
                revision="v0.3.0",
            )
        }
    }
}
register_kernel_mapping(custom_mapping)


model = GptOssForCausalLM.from_pretrained(
    model_id,
    dtype="bfloat16",
    device_map="auto",
    use_kernels=True,
    quantization_config=quantization_config,
).eval()

messages = [
    {"role": "system", "content": "What is Tensor Parallelism?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    reasoning_effort="low",
).to("cuda")

max_tokens = 512

with torch.inference_mode():
    start_time = time.perf_counter()
    generated = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        do_sample=False,
        temperature=None,
    )
    end_time = time.perf_counter()

print(tokenizer.decode(generated[0], skip_special_tokens=False))
print(f"Generation took {end_time - start_time:.2f} seconds")

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-09-22 Reasoning: low # Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions What is Tensor Parallelism? <|end|><|start|>assistant<|channel|>analysis<|message|>We need to explain what Tensor Parallelism is. It's a concept in distributed training of large language models. It refers to splitting the weight matrices (tensors) across multiple devices, so each device holds a slice of the matrix. During forward/backward passes, each device computes partial results and then they are aggregated. It's used to scale up models beyond single device memory. Also mention pipeline parallelism, data parallelism. Provide details: e.g., for a linear layer weight matrix W of shape (out_features, in_features), we can split along out_features dimension across devices. Each device computes its part of the output. Then gather. Similarly for attention QKV projections. Provide example: GPT-3 uses tensor parallelism. Provide benefits: memory savings, compute scaling. Provide challenges: communication overhead, load balancing. Provide typical frameworks: Megatron-LM, DeepSpeed, etc. Provide mention of "tensor model parallelism" vs "tensor parallelism" synonyms. Provide explanation of how it's implemented: using torch.distributed, NCCL, etc. Provide mention of "tensor parallelism" in context of "DeepSpeed ZeRO Stage 3" or "Megatron-LM" etc. Provide mention of "tensor parallelism" as part of "model parallelism" in large language models. Provide mention of "tensor parallelism" vs "data parallelism" vs "pipeline parallelism". Provide mention of "tensor parallelism" as "splitting weight matrices across GPUs" and "communication of partial results". Also mention "tensor parallelism" can be used for linear layers, self-attention, feed-forward networks. Provide example: In GPT, each layer has a linear projection for Q, K, V, and output. These can be split across GPUs. Provide mention of "tensor parallelism" as "model parallelism" that splits the model across GPUs. Provide mention of "tensor parallelism" as "splitting the weight matrices across GPUs" and "communication of partial results". Also mention "tensor parallelism" can be used for "embedding tables" as well. Provide mention of "tensor parallelism" as "embedding table partitioning". Also mention "tensor parallelism" can be used for "attention heads" splitting across GPUs. Also mention "tensor parallelism" can be used for "feed-forward networks" splitting across GPUs. Also mention "tensor parallelism" can be used for "softmax" and "layernorm" across GPUs. Also mention "tensor parallelism" can be used for Generation took 97.20 seconds

▶ UV Install Logs

Updating https://github.com/huggingface/transformers.git (HEAD) Updated https://github.com/huggingface/transformers.git (cbb290ec23ccd9b5c1d1ff4d333477449891debb) Downloading pillow (6.3MiB) Downloading numpy (15.9MiB) Building transformers @ git+https://github.com/huggingface/transformers.git@cbb290ec23ccd9b5c1d1ff4d333477449891debb Downloading nvidia-cuda-cupti-cu12 (9.8MiB) Downloading kiwisolver (1.4MiB) Downloading nvidia-nccl-cu12 (307.4MiB) Downloading tokenizers (3.1MiB) Downloading nvidia-cusparselt-cu12 (273.9MiB) Downloading nvidia-curand-cu12 (60.7MiB) Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB) Downloading pygments (1.2MiB) Downloading nvidia-cusolver-cu12 (255.1MiB) Downloading sympy (6.0MiB) Downloading jedi (1.5MiB) Downloading networkx (1.9MiB) Downloading nvidia-cudnn-cu12 (674.0MiB) Downloading nvidia-cublas-cu12 (566.8MiB) Downloading nvidia-cusparse-cu12 (274.9MiB) Downloading nvidia-nvjitlink-cu12 (37.4MiB) Downloading nvidia-cufft-cu12 (184.2MiB) Downloading matplotlib (8.3MiB) Downloading nvidia-cufile-cu12 (1.1MiB) Downloading fonttools (4.7MiB) Downloading triton (148.4MiB) Downloading hf-xet (3.0MiB) Downloading torch (846.8MiB) Downloading nvidia-cufile-cu12 Downloading kiwisolver Downloading pygments Downloading hf-xet Downloading tokenizers Downloading networkx Downloading fonttools Downloading pillow Downloading matplotlib Downloading nvidia-cuda-cupti-cu12 Downloading numpy Downloading sympy Downloading nvidia-nvjitlink-cu12 Built transformers @ git+https://github.com/huggingface/transformers.git@cbb290ec23ccd9b5c1d1ff4d333477449891debb Downloading jedi Downloading nvidia-curand-cu12 Downloading nvidia-cuda-nvrtc-cu12 Downloading triton Downloading nvidia-cufft-cu12 Downloading nvidia-cusolver-cu12 Downloading nvidia-cusparselt-cu12 Downloading nvidia-cusparse-cu12 Downloading nvidia-nccl-cu12 Downloading nvidia-cublas-cu12 Downloading nvidia-cudnn-cu12 Downloading torch Installed 69 packages in 208ms

Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s] Fetching 3 files: 33%|███▎ | 1/3 [00:13<00:26, 13.18s/it] Fetching 3 files: 67%|██████▋ | 2/3 [00:17<00:08, 8.21s/it] Fetching 3 files: 100%|██████████| 3/3 [00:17<00:00, 5.97s/it] You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:03<00:06, 3.22s/it] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:06<00:03, 3.13s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:08<00:00, 2.49s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:08<00:00, 2.67s/it] INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` Fetching 6 files: 0%| | 0/6 [00:00<?, ?it/s] Fetching 6 files: 17%|█▋ | 1/6 [00:00<00:00, 5.14it/s] Fetching 6 files: 50%|█████ | 3/6 [00:00<00:00, 6.83it/s] Fetching 6 files: 100%|██████████| 6/6 [00:00<00:00, 13.22it/s] /tmp/uvnote-run-he2n6v96/home/.cache/uv/environments-v2/setup-30bb029f3f83f37d/lib/python3.12/site-packages/kernels/layer.py:868: UserWarning: No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation. warnings.warn( INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` /tmp/uvnote-run-he2n6v96/home/.cache/uv/environments-v2/setup-30bb029f3f83f37d/lib/python3.12/site-packages/kernels/layer.py:868: UserWarning: No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation. warnings.warn( INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe` INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`

Reference kernel

▼ code ▼ output ▶ uv-logs | Cell: setup2 | 197.24s | Raw

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "accelerate>=1.10.1",
#     "torch>=2.7.0",
#     "kernels==0.10.0",
#     "transformers@https://github.com/huggingface/transformers.git",
#     "ipdb>=0.13.13",
#     "matplotlib>=3.7.2",
#     "numpy>=1.24.3",
# ]
# ///

import torch
from transformers import GptOssForCausalLM, PreTrainedTokenizerFast, Mxfp4Config
import time
import torch.nn as nn
from kernels import register_kernel_mapping, Mode, LayerRepository
import sys
import torch.profiler
import gc
import logging

# set to debug logging
logging.basicConfig(level=logging.INFO)

def reset_peak_memory_stats():
    """Clear CUDA cache and reset memory allocation counters."""
    torch.cuda.empty_cache()
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
    gc.collect()

def get_memory_stats():
    """Get current and peak CUDA memory usage."""
    if not torch.cuda.is_available():
        return {"allocated_gb": 0, "peak_gb": 0, "reserved_gb": 0}
    return {
        "allocated_gb": torch.cuda.memory_allocated() / 1e9,
        "peak_gb": torch.cuda.max_memory_allocated() / 1e9,
        "reserved_gb": torch.cuda.memory_reserved() / 1e9,
    }

def override_kernel_layer_name(cls_name: str, value) -> bool:
    """Helper to dynamically override the kernel_layer_name in a model class."""
    for mod in sys.modules.values():
        if mod is None:
            continue
        obj = getattr(mod, cls_name, None)
        if isinstance(obj, type) and issubclass(obj, nn.Module):
            setattr(obj, "kernel_layer_name", value)
            print(f"Overrode {cls_name}.kernel_layer_name to {value}")
            return True
    return False


# Init the model the normal way
model_id = "openai/gpt-oss-20b"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_id)
quantization_config = Mxfp4Config(dequantize=True)


from kernels import replace_kernel_forward_from_hub, register_kernel_mapping, LayerRepository, Mode

from transformers.models.gpt_oss.modeling_gpt_oss import GptOssMLP, GptOssRMSNorm

replace_kernel_forward_from_hub(GptOssRMSNorm, None)  # direct, type-safe
custom_mapping = {
    "Yamoe": {
        "cuda": {
            Mode.INFERENCE: LayerRepository(
                repo_id="drbh/yamoe",
                layer_name="Yamoe",
                revision="v0.3.0",
            )
        }
    }
}
register_kernel_mapping(custom_mapping)


model = GptOssForCausalLM.from_pretrained(
    model_id,
    dtype="bfloat16",
    device_map="auto",
    use_kernels=True,
    quantization_config=quantization_config,
).eval()

messages = [
    {"role": "system", "content": "What is Tensor Parallelism?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    reasoning_effort="low",
).to("cuda")

max_tokens = 512

with torch.inference_mode():
    start_time = time.perf_counter()
    generated = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        do_sample=False,
        temperature=None,
    )
    end_time = time.perf_counter()

print(tokenizer.decode(generated[0], skip_special_tokens=False))
print(f"Generation took {end_time - start_time:.2f} seconds")

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-09-22 Reasoning: low # Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions What is Tensor Parallelism? <|end|><|start|>assistant<|channel|>analysis<|message|>We need to answer: "What is Tensor Parallelism?" Provide explanation.<|end|><|start|>assistant<|channel|>final<|message|>**Tensor Parallelism** is a technique used in large‑scale deep‑learning training to split the computation of a single tensor (e.g., a weight matrix or an activation) across multiple GPUs or other accelerators. Instead of each device holding a full copy of the tensor, the tensor is partitioned—usually along one dimension (rows or columns)—and each device processes only its slice. The partial results are then combined (e.g., summed) to produce the final output. Key points: | Feature | Description | |---------|-------------| | **Purpose** | Reduce memory footprint per device and increase throughput for very large models. | | **How it works** | A weight matrix \(W\) of shape \((M, N)\) is split into \(k\) parts along one axis: \(W = [W_1, W_2, …, W_k]\). Each GPU holds one part and performs its local matrix multiplication with the input. The partial outputs are then summed (or concatenated) to form the final result. | | **Typical use‑case** | Training transformer models with billions of parameters where a single weight matrix would not fit on one GPU. | | **Relation to other parallelism** | It is one of several parallelism strategies: *data parallelism* (replicating the whole model across devices), *model parallelism* (splitting the model into sub‑modules), and *pipeline parallelism* (splitting the model into stages). Tensor parallelism is a fine‑grained form of model parallelism. | | **Implementation** | Frameworks like Megatron‑LM, DeepSpeed, and PyTorch’s `torch.distributed` provide primitives for tensor‑parallel operations. | | **Benefits** | • Lower per‑device memory usage. <br>• Enables training of models that would otherwise be impossible on a single device. <br>• Can improve compute utilization when combined with other parallelism strategies. | | **Challenges** | • Requires careful communication (e.g., all‑reduce) to combine partial results. <br>• Load balancing and communication overhead can become bottlenecks. <br>• Implementation complexity increases. | In short, **tensor parallelism** distributes the *internal* tensors of a model across multiple devices, allowing each device to work on a smaller piece of the tensor and then combine the results, Generation took 103.82 seconds

▶ UV Install Logs

Updating https://github.com/huggingface/transformers.git (HEAD) Updated https://github.com/huggingface/transformers.git (cbb290ec23ccd9b5c1d1ff4d333477449891debb) Downloading nvidia-cuda-cupti-cu12 (9.8MiB) Downloading pillow (6.3MiB) Downloading fonttools (4.7MiB) Downloading jedi (1.5MiB) Downloading kiwisolver (1.4MiB) Building transformers @ git+https://github.com/huggingface/transformers.git@cbb290ec23ccd9b5c1d1ff4d333477449891debb Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB) Downloading numpy (15.9MiB) Downloading matplotlib (8.3MiB) Downloading tokenizers (3.1MiB) Downloading nvidia-cufft-cu12 (184.2MiB) Downloading networkx (1.9MiB) Downloading nvidia-cusparse-cu12 (274.9MiB) Downloading nvidia-nvjitlink-cu12 (37.4MiB) Downloading nvidia-cufile-cu12 (1.1MiB) Downloading hf-xet (3.0MiB) Downloading nvidia-cudnn-cu12 (674.0MiB) Downloading nvidia-nccl-cu12 (307.4MiB) Downloading pygments (1.2MiB) Downloading nvidia-cublas-cu12 (566.8MiB) Downloading triton (148.4MiB) Downloading sympy (6.0MiB) Downloading nvidia-cusparselt-cu12 (273.9MiB) Downloading nvidia-curand-cu12 (60.7MiB) Downloading nvidia-cusolver-cu12 (255.1MiB) Downloading torch (846.8MiB) Downloading nvidia-cufile-cu12 Downloading kiwisolver Downloading pygments Downloading hf-xet Downloading tokenizers Downloading networkx Downloading fonttools Downloading pillow Downloading matplotlib Downloading nvidia-cuda-cupti-cu12 Downloading numpy Downloading sympy Downloading nvidia-nvjitlink-cu12 Downloading jedi Built transformers @ git+https://github.com/huggingface/transformers.git@cbb290ec23ccd9b5c1d1ff4d333477449891debb Downloading nvidia-curand-cu12 Downloading nvidia-cuda-nvrtc-cu12 Downloading triton Downloading nvidia-cufft-cu12 Downloading nvidia-cusolver-cu12 Downloading nvidia-cusparse-cu12 Downloading nvidia-cusparselt-cu12 Downloading nvidia-nccl-cu12 Downloading nvidia-cublas-cu12 Downloading nvidia-cudnn-cu12 Downloading torch Installed 69 packages in 208ms

Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s] Fetching 3 files: 33%|███▎ | 1/3 [00:13<00:27, 13.63s/it] Fetching 3 files: 67%|██████▋ | 2/3 [00:17<00:07, 7.65s/it] Fetching 3 files: 100%|██████████| 3/3 [00:17<00:00, 5.70s/it] You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:03<00:06, 3.23s/it] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:06<00:03, 3.14s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:08<00:00, 2.49s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:08<00:00, 2.68s/it] INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` Fetching 66 files: 0%| | 0/66 [00:00<?, ?it/s] Fetching 66 files: 2%|▏ | 1/66 [00:00<00:20, 3.25it/s] Fetching 66 files: 17%|█▋ | 11/66 [00:00<00:01, 28.26it/s] Fetching 66 files: 26%|██▌ | 17/66 [00:00<00:02, 18.20it/s] Fetching 66 files: 100%|██████████| 66/66 [00:00<00:00, 69.67it/s] /tmp/uvnote-run-0ew4aumc/home/.cache/uv/environments-v2/setup2-d4e9795d8c8c2492/lib/python3.12/site-packages/kernels/layer.py:868: UserWarning: No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation. warnings.warn( INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` /tmp/uvnote-run-0ew4aumc/home/.cache/uv/environments-v2/setup2-d4e9795d8c8c2492/lib/python3.12/site-packages/kernels/layer.py:868: UserWarning: No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation. warnings.warn( INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`