▼ code
▼ output
▶ uv-logs
|
Cell: setup | 191.24s
|
Raw
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | # /// script # requires-python = ">=3.12" # dependencies = [ # "accelerate>=1.10.1", # "torch>=2.7.0", # "kernels==0.10.0", # "transformers@https://github.com/huggingface/transformers.git", # "ipdb>=0.13.13", # "matplotlib>=3.7.2", # "numpy>=1.24.3", # ] # /// import torch from transformers import GptOssForCausalLM, PreTrainedTokenizerFast, Mxfp4Config import time import torch.nn as nn from kernels import register_kernel_mapping, Mode, LayerRepository import sys import torch.profiler import gc import logging # set to debug logging logging.basicConfig(level=logging.INFO) def reset_peak_memory_stats(): """Clear CUDA cache and reset memory allocation counters.""" torch.cuda.empty_cache() if torch.cuda.is_available(): torch.cuda.reset_peak_memory_stats() gc.collect() def get_memory_stats(): """Get current and peak CUDA memory usage.""" if not torch.cuda.is_available(): return {"allocated_gb": 0, "peak_gb": 0, "reserved_gb": 0} return { "allocated_gb": torch.cuda.memory_allocated() / 1e9, "peak_gb": torch.cuda.max_memory_allocated() / 1e9, "reserved_gb": torch.cuda.memory_reserved() / 1e9, } def override_kernel_layer_name(cls_name: str, value) -> bool: """Helper to dynamically override the kernel_layer_name in a model class.""" for mod in sys.modules.values(): if mod is None: continue obj = getattr(mod, cls_name, None) if isinstance(obj, type) and issubclass(obj, nn.Module): setattr(obj, "kernel_layer_name", value) print(f"Overrode {cls_name}.kernel_layer_name to {value}") return True return False # Init the model the normal way model_id = "openai/gpt-oss-20b" tokenizer = PreTrainedTokenizerFast.from_pretrained(model_id) quantization_config = Mxfp4Config(dequantize=True) from kernels import replace_kernel_forward_from_hub, register_kernel_mapping, LayerRepository, Mode from transformers.models.gpt_oss.modeling_gpt_oss import GptOssMLP, GptOssRMSNorm replace_kernel_forward_from_hub(GptOssMLP, "Yamoe") # direct, type-safe replace_kernel_forward_from_hub(GptOssRMSNorm, None) # direct, type-safe custom_mapping = { "Yamoe": { "cuda": { Mode.INFERENCE: LayerRepository( repo_id="drbh/yamoe", layer_name="Yamoe", revision="v0.3.0", ) } } } register_kernel_mapping(custom_mapping) model = GptOssForCausalLM.from_pretrained( model_id, dtype="bfloat16", device_map="auto", use_kernels=True, quantization_config=quantization_config, ).eval() messages = [ {"role": "system", "content": "What is Tensor Parallelism?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt", return_dict=True, reasoning_effort="low", ).to("cuda") max_tokens = 512 with torch.inference_mode(): start_time = time.perf_counter() generated = model.generate( **inputs, max_new_tokens=max_tokens, do_sample=False, temperature=None, ) end_time = time.perf_counter() print(tokenizer.decode(generated[0], skip_special_tokens=False)) print(f"Generation took {end_time - start_time:.2f} seconds") |
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-22
Reasoning: low
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions
What is Tensor Parallelism?
<|end|><|start|>assistant<|channel|>analysis<|message|>We need to explain what Tensor Parallelism is. It's a concept in distributed training of large language models. It refers to splitting the weight matrices (tensors) across multiple devices, so each device holds a slice of the matrix. During forward/backward passes, each device computes partial results and then they are aggregated. It's used to scale up models beyond single device memory. Also mention pipeline parallelism, data parallelism. Provide details: e.g., for a linear layer weight matrix W of shape (out_features, in_features), we can split along out_features dimension across devices. Each device computes its part of the output. Then gather. Similarly for attention QKV projections. Provide example: GPT-3 uses tensor parallelism. Provide benefits: memory savings, compute scaling. Provide challenges: communication overhead, load balancing. Provide typical frameworks: Megatron-LM, DeepSpeed, etc. Provide mention of "tensor model parallelism" vs "tensor parallelism" synonyms. Provide explanation of how it's implemented: using torch.distributed, NCCL, etc. Provide mention of "tensor parallelism" in context of "DeepSpeed ZeRO Stage 3" or "Megatron-LM" etc. Provide mention of "tensor parallelism" as part of "model parallelism" in large language models. Provide mention of "tensor parallelism" vs "data parallelism" vs "pipeline parallelism". Provide mention of "tensor parallelism" as "splitting weight matrices across GPUs" and "communication of partial results".
Also mention "tensor parallelism" can be used for linear layers, self-attention, feed-forward networks. Provide example: In GPT, each layer has a linear projection for Q, K, V, and output. These can be split across GPUs. Provide mention of "tensor parallelism" as "model parallelism" that splits the model across GPUs. Provide mention of "tensor parallelism" as "splitting the weight matrices across GPUs" and "communication of partial results".
Also mention "tensor parallelism" can be used for "embedding tables" as well. Provide mention of "tensor parallelism" as "embedding table partitioning".
Also mention "tensor parallelism" can be used for "attention heads" splitting across GPUs.
Also mention "tensor parallelism" can be used for "feed-forward networks" splitting across GPUs.
Also mention "tensor parallelism" can be used for "softmax" and "layernorm" across GPUs.
Also mention "tensor parallelism" can be used for
Generation took 97.20 seconds
▶ UV Install Logs
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]
Fetching 3 files: 33%|███▎ | 1/3 [00:13<00:26, 13.18s/it]
Fetching 3 files: 67%|██████▋ | 2/3 [00:17<00:08, 8.21s/it]
Fetching 3 files: 100%|██████████| 3/3 [00:17<00:00, 5.97s/it]
You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards: 33%|███▎ | 1/3 [00:03<00:06, 3.22s/it]
Loading checkpoint shards: 67%|██████▋ | 2/3 [00:06<00:03, 3.13s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:08<00:00, 2.49s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:08<00:00, 2.67s/it]
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
Fetching 6 files: 0%| | 0/6 [00:00<?, ?it/s]
Fetching 6 files: 17%|█▋ | 1/6 [00:00<00:00, 5.14it/s]
Fetching 6 files: 50%|█████ | 3/6 [00:00<00:00, 6.83it/s]
Fetching 6 files: 100%|██████████| 6/6 [00:00<00:00, 13.22it/s]
/tmp/uvnote-run-he2n6v96/home/.cache/uv/environments-v2/setup-30bb029f3f83f37d/lib/python3.12/site-packages/kernels/layer.py:868: UserWarning:
No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
warnings.warn(
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
/tmp/uvnote-run-he2n6v96/home/.cache/uv/environments-v2/setup-30bb029f3f83f37d/lib/python3.12/site-packages/kernels/layer.py:868: UserWarning:
No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation.
warnings.warn(
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`
INFO:root:Using layer `Yamoe` from repo `drbh/yamoe` (revision: v0.3.0) for layer `Yamoe`