lccurious commited on
Commit
b359dde
·
1 Parent(s): ecca2cf

First model version

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +139 -0
  2. config.json +55 -0
  3. configuration_llada2_moe.py +85 -0
  4. generation_config.json +7 -0
  5. model-00001-of-00042.safetensors +3 -0
  6. model-00002-of-00042.safetensors +3 -0
  7. model-00003-of-00042.safetensors +3 -0
  8. model-00004-of-00042.safetensors +3 -0
  9. model-00005-of-00042.safetensors +3 -0
  10. model-00006-of-00042.safetensors +3 -0
  11. model-00007-of-00042.safetensors +3 -0
  12. model-00008-of-00042.safetensors +3 -0
  13. model-00009-of-00042.safetensors +3 -0
  14. model-00010-of-00042.safetensors +3 -0
  15. model-00011-of-00042.safetensors +3 -0
  16. model-00012-of-00042.safetensors +3 -0
  17. model-00013-of-00042.safetensors +3 -0
  18. model-00014-of-00042.safetensors +3 -0
  19. model-00015-of-00042.safetensors +3 -0
  20. model-00016-of-00042.safetensors +3 -0
  21. model-00017-of-00042.safetensors +3 -0
  22. model-00018-of-00042.safetensors +3 -0
  23. model-00019-of-00042.safetensors +3 -0
  24. model-00020-of-00042.safetensors +3 -0
  25. model-00021-of-00042.safetensors +3 -0
  26. model-00022-of-00042.safetensors +3 -0
  27. model-00023-of-00042.safetensors +3 -0
  28. model-00024-of-00042.safetensors +3 -0
  29. model-00025-of-00042.safetensors +3 -0
  30. model-00026-of-00042.safetensors +3 -0
  31. model-00027-of-00042.safetensors +3 -0
  32. model-00028-of-00042.safetensors +3 -0
  33. model-00029-of-00042.safetensors +3 -0
  34. model-00030-of-00042.safetensors +3 -0
  35. model-00031-of-00042.safetensors +3 -0
  36. model-00032-of-00042.safetensors +3 -0
  37. model-00033-of-00042.safetensors +3 -0
  38. model-00034-of-00042.safetensors +3 -0
  39. model-00035-of-00042.safetensors +3 -0
  40. model-00036-of-00042.safetensors +3 -0
  41. model-00037-of-00042.safetensors +3 -0
  42. model-00038-of-00042.safetensors +3 -0
  43. model-00039-of-00042.safetensors +3 -0
  44. model-00040-of-00042.safetensors +3 -0
  45. model-00041-of-00042.safetensors +3 -0
  46. model-00042-of-00042.safetensors +3 -0
  47. model.safetensors.index.json +0 -0
  48. modeling_llada2_moe.py +1621 -0
  49. special_tokens_map.json +8 -0
  50. tokenizer.json +0 -0
README.md CHANGED
@@ -1,3 +1,142 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ DA2.0-flash-preview
5
+ **LLaDA2-flash-preview** is a diffusion language model featuring a 100BA6B Mixture-of-Experts (MoE) architecture. As an enhanced, instruction-tuned iteration of the LLaDA series, it is optimized for practical applications.
6
+
7
+ <div align="center">
8
+ <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*kLORSaRfSK8AAAAAgIAAAAgAemJ7AQ/original" width="800" />
9
+ </div>
10
+
11
+ ---
12
+
13
+ | Benchmark | Ling-flash-2.0 | LLaDA2.0-mini-preview | LLaDA2.0-flash-preview |
14
+ | :------------------------------ | :-------------: | :-------------------------: | :---------------------: |
15
+ | **Average** | 79.93 | 66.59 | 77.03 |
16
+ | **Knowledge** | | | |
17
+ | MMLU | 87.98 | 72.49 | 83.15 |
18
+ | MMLU-PRO | 76.84 | 49.22 | 66.16 |
19
+ | CMMLU | 86.59 | 67.53 | 79.64 |
20
+ | C-EVAL | 88.03 | 66.54 | 79.28 |
21
+ | **Reasoning** | | | |
22
+ | squad2.0 | 81.32 | 85.61 | 90.61 |
23
+ | drop | 88.32 | 79.49 | 88.17 |
24
+ | korbench | 68.96 | 37.26 | 53.28 |
25
+ | **Coding** | | | |
26
+ | CruxEval-O | 82.75 | 61.88 | 74.50 |
27
+ | mbpp | 85.01 | 77.75 | 86.65 |
28
+ | MultiPL-E | 65.76 | 62.43 | 72.38 |
29
+ | humaneval | 85.98 | 80.49 | 88.41 |
30
+ | Bigcodebench-Full | 40.70 | 30.44 | 40.44 |
31
+ | **Math** | | | |
32
+ | GSM8K | 95.45 | 89.01 | 95.75 |
33
+ | math | 96.1 | 73.50 | 83.52 |
34
+ | **Agent & Alignment** | | | |
35
+ | BFCL_Live | 67.57 | 74.11 | 74.86 |
36
+ | IFEval-strict -prompt | 81.52 | 62.50 | 75.60 |
37
+
38
+
39
+
40
+ ## 🚀 Performance Highlights
41
+ + **Leading MoE Architecture**:
42
+ The open-source **Mixture-of-Experts (MoE) diffusion large language model**, pre-trained from scratch on approximately **20 trillion tokens**.
43
+ + **Efficient Inference**:
44
+ With **100 billion total parameters**, only **6.1 billion** are activated during inference. LLaDA-flash-preview significantly reduces computational costs while outperforming open-source dense models of similar scale.
45
+ + **Impressive Performance on Code & Complex Reasoning**:
46
+ Excels in tasks such as **code generation** and **advanced mathematical reasoning**, demonstrating strong reasoning capabilities.
47
+ + **Tool Use**:
48
+ Supports **tool calling** and achieves excellent performance in complex agent-based tasks.
49
+ + **Open & Extensible**:
50
+ Fully open-source with commitment to transparency. We plan to release a **leading inference framework** in the future and continue investing in cutting-edge areas like **diffusion LLMs (dLLM)** to drive disruptive innovation.
51
+
52
+ ## 🗺️ What's Next
53
+
54
+ + **Supercharged Reasoning with LLaDA 2.0:** LLaDA 2.0 series will be fine-tuned with **Reinforcement Learning**, unlocking a new level of sophisticated reasoning and problem-solving abilities.
55
+ + **Tools for Innovators:** we will release a **detailed tutorial** and our complete **post-training framework**. Whether you want to master the current model or build your own customized versions, you'll have the tools you need. Stay tuned
56
+
57
+ ---
58
+
59
+ ## 📦 Model Variants
60
+ | Model ID | Description | Hugging Face Link |
61
+ | --- | --- | --- |
62
+ | `inclusionAI/LLaDA2-mini-preview` | Instruction-tuned model, ready for downstream applications. | [🤗 Model Card](https://huggingface.co/inclusionAI/LLaDA2.0-mini-preview) |
63
+ | `inclusionAI/LLaDA2-flash-preview` | Instruction-tuned model, ready for downstream applications. | [🤗 Model Card](https://huggingface.co/inclusionAI/LLaDA2.0-flash-preview) |
64
+
65
+
66
+ ---
67
+
68
+ ## 🔍 Model Overview
69
+ **LLaDA2.0-flash-preview** has the following specifications:
70
+
71
+ + **Type**: Mixture-of-Experts (MoE) Diffusion Language Model
72
+ + **Total Parameters (Non-Embedding)**: 100B
73
+ + **Number of Layers**: 32
74
+ + **Attention Heads**: 32
75
+ + **Context Length**: 4,096 tokens
76
+ + **Position Embedding**: Rotary (RoPE)
77
+ + **Vocabulary Size**: 157,184
78
+
79
+ ---
80
+
81
+ ### 🤗 Hugging Face Transformers
82
+ Make sure you have `transformers` and its dependencies installed:
83
+
84
+ ```python
85
+ import torch
86
+ import torch.nn.functional as F
87
+ from transformers import AutoModelForCausalLM
88
+ from transformers import AutoTokenizer
89
+
90
+ model_path = "/path/to/LLaDA2-mini-preview"
91
+ device = "auto"
92
+ model = AutoModelForCausalLM.from_pretrained(
93
+ model_path, trust_remote_code=True, device_map=device
94
+ )
95
+ model = model.to(torch.bfloat16)
96
+ model.eval()
97
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
98
+
99
+ prompt = "Why does Camus think that Sisyphus is happy?"
100
+ input_ids = tokenizer.apply_chat_template(
101
+ [{"role": "user", "content": prompt}],
102
+ add_generation_prompt=True,
103
+ tokenize=True,
104
+ return_tensors="pt",
105
+ )
106
+ generated_tokens = model.generate(
107
+ inputs=input_ids,
108
+ eos_early_stop=True,
109
+ gen_length=512,
110
+ block_length=32,
111
+ steps=32,
112
+ temperature=0.0,
113
+ )
114
+ generated_answer = tokenizer.decode(
115
+ generated_tokens[0],
116
+ skip_special_tokens=True,
117
+ )
118
+ print(generated_answer)
119
+ ```
120
+
121
+ ### Best Practices
122
+ To achieve optimal performance, we recommend the following settings:
123
+
124
+ 1. **Sampling Parameters**:
125
+ We suggest using `Temperature=0.0`, `block_length=32`, and `steps=32`. Using a higher temperature value may occasionally result in language mixing and a slight decrease in model performance.
126
+
127
+ 2. **Adequate Output Length**:
128
+ We recommend using an output length of 2048 tokens for most queries. For benchmarking on problems require more output length, such as those found in math and programming competitions, we suggest setting the max output length to 4096 tokens.
129
+
130
+
131
+ ---
132
+
133
+ ## 🌐 License
134
+ This project is licensed under the terms of the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
135
+
136
+ ---
137
+
138
+ ## 🤝 Contact & Collaboration
139
+ For questions, collaborations, or feedback, please reach out via [Hugging Face](https://huggingface.co/inclusionAI/LLaDA2.0-mini-preview) or open an issue in the [repository](https://github.com/inclusionAI).
140
+
141
+ 👉 Join us in advancing open, efficient, and intelligent language models!
142
+
config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LLaDA2MoeModelLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_llada2_moe.LLaDA2MoeConfig",
8
+ "AutoModel": "modeling_llada2_moe.LLaDA2MoeModel",
9
+ "AutoModelForCausalLM": "modeling_llada2_moe.LLaDA2MoeModelLM"
10
+ },
11
+ "num_hidden_layers": 32,
12
+ "hidden_size": 4096,
13
+ "intermediate_size": 9216,
14
+ "first_k_dense_replace": 1,
15
+ "hidden_act": "silu",
16
+ "max_position_embeddings": 16384,
17
+ "model_type": "llada2_moe",
18
+ "moe_intermediate_size": 1024,
19
+ "norm_topk_prob": true,
20
+ "num_experts_per_tok": 8,
21
+ "norm_head": false,
22
+ "num_attention_heads": 32,
23
+ "num_experts": 256,
24
+ "num_key_value_heads": 4,
25
+ "rope_theta": 600000,
26
+ "rope_scaling": null,
27
+ "tie_word_embeddings": false,
28
+ "torch_dtype": "bfloat16",
29
+ "transformers_version": "4.52.3",
30
+ "use_bias": false,
31
+ "use_rmsnorm": true,
32
+ "rms_norm_eps": 1e-06,
33
+ "head_dim": 128,
34
+ "num_shared_experts": 1,
35
+ "use_cache": true,
36
+ "use_qkv_bias": false,
37
+ "embedding_dropout": 0.0,
38
+ "norm_softmax": false,
39
+ "output_dropout": 0.0,
40
+ "vocab_size": 157184,
41
+ "rotary_dim": 64,
42
+ "using_split_qkv_in_self_attention": false,
43
+ "router_dtype": "fp32",
44
+ "moe_router_enable_expert_bias": true,
45
+ "routed_scaling_factor": 2.5,
46
+ "n_group": 8,
47
+ "topk_group": 4,
48
+ "score_function": "sigmoid",
49
+ "initializer_range": 0.02,
50
+ "max_window_layers": 28,
51
+ "output_router_logits": false,
52
+ "pad_token_id": 156892,
53
+ "partial_rotary_factor": 0.5,
54
+ "use_sliding_window": false
55
+ }
configuration_llada2_moe.py ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """LLaDA2 MoE model configuration"""
2
+
3
+ from transformers.configuration_utils import PretrainedConfig
4
+
5
+
6
+ class LLaDA2MoeConfig(PretrainedConfig):
7
+ model_type = "llada2_moe"
8
+
9
+ def __init__(
10
+ self,
11
+ vocab_size=30592,
12
+ hidden_size=1024,
13
+ intermediate_size=None,
14
+ num_hidden_layers=24,
15
+ num_attention_heads=16,
16
+ num_key_value_heads=0,
17
+ hidden_act="silu",
18
+ use_qkv_bias=False, # llada2 only
19
+ use_qk_norm=False,
20
+ use_bias=True, # llada2 only
21
+ rms_norm_eps=1e-05,
22
+ norm_head=False, # llada2 only
23
+ tie_word_embeddings=False, # PretrainedConfig key, here change default value.
24
+ embedding_dropout=0.1,
25
+ attention_dropout=0.1,
26
+ output_dropout=0.1,
27
+ initializer_range=0.02,
28
+ max_position_embeddings=16384,
29
+ rope_theta=10000.0,
30
+ use_cache=True,
31
+ use_sliding_window=False,
32
+ sliding_window=4096,
33
+ max_window_layers=28,
34
+ rope_scaling=None,
35
+ pad_token_id=126081,
36
+ num_experts=16,
37
+ num_shared_experts=0,
38
+ num_experts_per_tok=2,
39
+ n_group=8,
40
+ topk_group=4,
41
+ routed_scaling_factor=2.5,
42
+ moe_intermediate_size=None,
43
+ first_k_dense_replace=0,
44
+ head_dim=None,
45
+ output_router_logits=False,
46
+ partial_rotary_factor=0.5,
47
+ **kwargs,
48
+ ):
49
+ self.num_hidden_layers = num_hidden_layers
50
+ self.vocab_size = vocab_size
51
+ self.hidden_size = hidden_size
52
+ self.intermediate_size = intermediate_size
53
+ self.num_attention_heads = num_attention_heads
54
+ self.num_key_value_heads = num_key_value_heads
55
+ self.hidden_act = hidden_act
56
+ self.use_qkv_bias = use_qkv_bias
57
+ self.use_bias = use_bias
58
+ self.norm_head = norm_head
59
+ self.rms_norm_eps = rms_norm_eps
60
+ self.embedding_dropout = embedding_dropout
61
+ self.attention_dropout = attention_dropout
62
+ self.output_dropout = output_dropout
63
+ self.initializer_range = initializer_range
64
+ self.max_position_embeddings = max_position_embeddings
65
+ self.rope_theta = rope_theta
66
+ self.use_cache = use_cache
67
+ self.use_sliding_window = use_sliding_window
68
+ self.sliding_window = sliding_window
69
+ self.max_window_layers = max_window_layers
70
+ self.head_dim = head_dim or self.hidden_size // self.num_attention_heads
71
+ self.rope_scaling = rope_scaling
72
+
73
+ # MoE configs
74
+ self.num_experts = num_experts
75
+ self.num_shared_experts = num_shared_experts
76
+ self.num_experts_per_tok = num_experts_per_tok
77
+ self.n_group = n_group
78
+ self.topk_group = topk_group
79
+ self.moe_intermediate_size = moe_intermediate_size
80
+ self.first_k_dense_replace = first_k_dense_replace
81
+ self.output_router_logits = output_router_logits
82
+ self.routed_scaling_factor = routed_scaling_factor
83
+ self.partial_rotary_factor = partial_rotary_factor
84
+
85
+ super().__init__(pad_token_id=pad_token_id, tie_word_embeddings=tie_word_embeddings, **kwargs)
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "eos_token_id": 156892,
4
+ "pad_token_id": 156892,
5
+ "transformers_version": "4.46.3",
6
+ "use_cache": false
7
+ }
model-00001-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cae1f1899b726a1efb16c72e33e4ae18d708fadeb681f4063a3e2728be55cc3c
3
+ size 4995489472
model-00002-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7d763bf5c5f0869a744dd5d50b24caf28a3dac62d76786b1a8cf3ba32f55bf28
3
+ size 4993392768
model-00003-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:abc81bc508ea43b4545bddcc9ea38df2d950386cf0812c67f45bea0a2ecf947d
3
+ size 4995490120
model-00004-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dd65defd15bcf04e7c8ce049e488253f403a78adaf1a5f806e3cb45d661298bc
3
+ size 4997586864
model-00005-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fd81380decf776ce43753d3d0c2ef4a6b0088790cdce4ef69cf9c195466e5583
3
+ size 4374723752
model-00006-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:328248165afb3c2eed928505cf40c47bf448b26e2865b82bd996b5c6ffea6f21
3
+ size 4993466536
model-00007-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:870af56ce1acaa352132d96d8cd910c818e1c786205f2725826ad3e5e7e0f5da
3
+ size 4993473216
model-00008-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e19aeb008c86f3ca031f425448c3c08da5286e790149bd750b2590b6490005cb
3
+ size 4995571368
model-00009-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23b52735c20aad2fdb3b0866dec7910ab2abfb365e58021a5d0f240421e54c94
3
+ size 4995586784
model-00010-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b32432d508676f0015ba0c7449869a559f08e694d01e3371bb35b732c4cd409
3
+ size 4997717864
model-00011-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3a774fedc9a6bf86cc4dc1a3389d67fcb723be335ac2397048ddc2c30ea71833
3
+ size 4999725680
model-00012-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:809d6188a4f18e527b5030ecdd410cdad5468529fafe246f8d98206155a7b984
3
+ size 4999684824
model-00013-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c95b5c9260346b29a3f239a308f6a919d141da9730f04a4c88835068d5cb4cdd
3
+ size 4999685416
model-00014-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:290eea128b4b50a7a3649f6627b7eaf607fff8709247431fcfb2edb533d83189
3
+ size 4999685056
model-00015-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c82137ad49baf58dcdb052de85023706f4d063839bcfee92d53b93732e564201
3
+ size 4999685104
model-00016-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba801d0f9f98d750c65c0b54da3452c68d7625169796d6ef5dc1d4401e47577e
3
+ size 4995471368
model-00017-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b45b61ac9f24f0031afce46f7d71066de19131e049a893fcb56055e4722aa46b
3
+ size 4995489520
model-00018-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dd78deb2feaf63730dee7fff6ab29353737bd811fa09c36ed4ee2c529103af35
3
+ size 4995490040
model-00019-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9d8c7a6f3097ee803d243fb732628c910bb248ae0ebad149daf57e42d7c7bf56
3
+ size 4995490048
model-00020-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46e92912762b9ce2fc56b89b35ff62338a374331a33013cd6ff39ebaf3c6bd9f
3
+ size 4995490064
model-00021-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:909ef15a6968063a3934c17c6198e52231e45363d320990cc22a5166ea44040c
3
+ size 4995489784
model-00022-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c0ab401babeaa57a055311db9d51d2e3a90c096661e1cb1a7b972a9088e78793
3
+ size 4999684496
model-00023-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9726a8222732dfc2feb660e971f95444a461b27689767517f2154aac38deac83
3
+ size 4999685328
model-00024-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:485656bd4f5707be353112b1ad35a72248aacb78aa62841eaaeee43bf278baa4
3
+ size 4999685304
model-00025-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9d203d37002e40e79b00e6da3c4bdb46aaf1b9dab28bfabf1541a09746bb93e
3
+ size 4999685072
model-00026-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:818cdd523b14557924f51156ea734cc29e42b88301fcf889b107c39c0e393fd2
3
+ size 4999684920
model-00027-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d7e6e2e34db14b93e586c95f60ecd74ece24df100867f1775f3a49904b667521
3
+ size 4999684648
model-00028-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:13c2e60838c92b6d29c7e01f71abc70a90999c7e9c0b2452287e6878000c1078
3
+ size 4999685096
model-00029-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9076d46655ace4528e99990aca83e32d790af41562f0bfccdd9dfc67b8ebc3fe
3
+ size 4999685040
model-00030-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c07c81bb0845211ccb2d9f508d56ac14243c2852cf7ff3cdd9d1aa7449e38ee
3
+ size 4999685096
model-00031-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:05fd4fc0138ce1029867441008058a8abfbaeea1fd115cf7f3d522c8f29a14fc
3
+ size 4999685064
model-00032-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5a4f4ff358c545a8167fa54449716e433988a66b7bc535511e9a86a340e87e9d
3
+ size 4999684432
model-00033-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4209ab0fe830fcc3fe331f2d7e557078032bf61bef647ef976f8c0f701f47a75
3
+ size 4999684672
model-00034-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fecaa2801cb2f5c30ce86270f6ad8997fd27464857827e057166fa6c860fc272
3
+ size 4999685096
model-00035-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:541f6907216cdf8bf51ff6a5a5e33813647c1d8e4c59943f930dd0747a7e99b6
3
+ size 4999685344
model-00036-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e48a7c7eaa34f3421240ba8dc58c7f9cc6cecd5753ecbd4f34e0a22af3584f23
3
+ size 4999685264
model-00037-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e2168633d36f4a9beff90356852bb37e5e2ac7c0c2735e9ca969f247a73a7b41
3
+ size 4999684592
model-00038-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f110b326fe5556b4b34bf0305c627d76f22379ef7aa7ea6510a799e7bc947174
3
+ size 4999684840
model-00039-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ad7e6a3d6b8a2f74e85507a28ae95a34ced16c5f44513d001eb03e1db1521420
3
+ size 4999685248
model-00040-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0566bb42ea7b8d61ce7c72cb53f4242f20869e56dd69c94cb7252b4218468317
3
+ size 4999685104
model-00041-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a6642d026b551d83b50f2b2feacdfeb228530b9203d217b723c791a6c5a3b0a0
3
+ size 4999685008
model-00042-of-00042.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:02ca31d066168d125f472f21b1c135989f2baac296ebea721d9adf7c7c04a410
3
+ size 1484847944
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
modeling_llada2_moe.py ADDED
@@ -0,0 +1,1621 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 Antgroup and The HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """PyTorch LLaDA2MoE model."""
21
+
22
+ import math
23
+ import warnings
24
+ from typing import List, Optional, Tuple, Union
25
+
26
+ import torch
27
+ import torch.nn.functional as F
28
+ import torch.utils.checkpoint
29
+ from torch import nn
30
+ from torch.nn import CrossEntropyLoss
31
+
32
+ from transformers.activations import ACT2FN
33
+ from transformers.cache_utils import Cache, DynamicCache
34
+ from transformers.modeling_attn_mask_utils import (
35
+ AttentionMaskConverter,
36
+ _prepare_4d_attention_mask,
37
+ _prepare_4d_causal_attention_mask,
38
+ _prepare_4d_causal_attention_mask_for_sdpa,
39
+ )
40
+ from transformers.modeling_outputs import (
41
+ MoeModelOutputWithPast,
42
+ MoeCausalLMOutputWithPast,
43
+ )
44
+ from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
45
+ from transformers.modeling_utils import PreTrainedModel
46
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS, is_torch_greater_or_equal_than_1_13
47
+ from transformers.utils import (
48
+ add_start_docstrings,
49
+ add_start_docstrings_to_model_forward,
50
+ is_flash_attn_2_available,
51
+ is_flash_attn_greater_or_equal_2_10,
52
+ logging,
53
+ replace_return_docstrings,
54
+ )
55
+ from transformers.utils.import_utils import is_torch_fx_available
56
+ from .configuration_llada2_moe import LLaDA2MoeConfig
57
+ from transformers.generation.utils import GenerationMixin
58
+
59
+
60
+ if is_flash_attn_2_available():
61
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
62
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
63
+
64
+
65
+ # This makes `_prepare_4d_causal_attention_mask` a leaf function in the FX graph.
66
+ # It means that the function will not be traced through and simply appear as a node in the graph.
67
+ if is_torch_fx_available():
68
+ if not is_torch_greater_or_equal_than_1_13:
69
+ import torch.fx
70
+
71
+ _prepare_4d_causal_attention_mask = torch.fx.wrap(_prepare_4d_causal_attention_mask)
72
+
73
+
74
+ logger = logging.get_logger(__name__)
75
+
76
+ _CONFIG_FOR_DOC = "LLaDA2MoeConfig"
77
+
78
+
79
+ def _get_unpad_data(attention_mask):
80
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
81
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
82
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
83
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
84
+ return (
85
+ indices,
86
+ cu_seqlens,
87
+ max_seqlen_in_batch,
88
+ )
89
+
90
+
91
+ class LLaDA2MoeRMSNorm(nn.Module):
92
+ def __init__(self, hidden_size, eps=1e-6):
93
+ """
94
+ LLaDA2MoeRMSNorm is equivalent to T5LayerNorm
95
+ """
96
+ super().__init__()
97
+ self.weight = nn.Parameter(torch.ones(hidden_size))
98
+ self.variance_epsilon = eps
99
+
100
+ def forward(self, hidden_states):
101
+ input_dtype = hidden_states.dtype
102
+ hidden_states = hidden_states.to(torch.float32)
103
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
104
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
105
+ return self.weight * hidden_states.to(input_dtype)
106
+
107
+
108
+ ALL_LAYERNORM_LAYERS.append(LLaDA2MoeRMSNorm)
109
+
110
+
111
+ class LLaDA2MoeRotaryEmbedding(nn.Module):
112
+ def __init__(self, config: LLaDA2MoeConfig, device=None):
113
+ super().__init__()
114
+ # BC: "rope_type" was originally "type"
115
+ if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
116
+ self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
117
+ else:
118
+ self.rope_type = "default"
119
+ self.max_seq_len_cached = config.max_position_embeddings
120
+ self.original_max_seq_len = config.max_position_embeddings
121
+
122
+ self.config = config
123
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
124
+
125
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
126
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
127
+ self.original_inv_freq = self.inv_freq
128
+
129
+ @torch.no_grad()
130
+ @dynamic_rope_update # power user: used with advanced RoPE types (e.g. dynamic rope)
131
+ def forward(self, x, position_ids):
132
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
133
+ position_ids_expanded = position_ids[:, None, :].float()
134
+
135
+ device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
136
+ with torch.autocast(device_type=device_type, enabled=False): # Force float32
137
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
138
+ emb = torch.cat((freqs, freqs), dim=-1)
139
+ cos = emb.cos() * self.attention_scaling
140
+ sin = emb.sin() * self.attention_scaling
141
+
142
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
143
+
144
+
145
+ # Copied from transformers.models.llama.modeling_llama.rotate_half
146
+ def rotate_half(x):
147
+ """Rotates half the hidden dims of the input."""
148
+ x1 = x[..., : x.shape[-1] // 2]
149
+ x2 = x[..., x.shape[-1] // 2 :]
150
+ return torch.cat((-x2, x1), dim=-1)
151
+
152
+
153
+ # Copied from transformers.models.llama.modeling_llama.apply_rotary_pos_emb
154
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
155
+ """Applies Rotary Position Embedding to the query and key tensors.
156
+
157
+ Args:
158
+ q (`torch.Tensor`): The query tensor.
159
+ k (`torch.Tensor`): The key tensor.
160
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
161
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
162
+ position_ids (`torch.Tensor`):
163
+ The position indices of the tokens corresponding to the query and key tensors. For example, this can be
164
+ used to pass offsetted position ids when working with a KV-cache.
165
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
166
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
167
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
168
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
169
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
170
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
171
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
172
+ Returns:
173
+ `tuple(torch.Tensor)` comprising the query and key tensors rotated using the Rotary Position Embedding.
174
+ """
175
+ cos = cos.unsqueeze(unsqueeze_dim)
176
+ sin = sin.unsqueeze(unsqueeze_dim)
177
+
178
+ # Keep half or full tensor for later concatenation
179
+ rotary_dim = cos.shape[-1]
180
+ q_rot, q_pass = q[..., :rotary_dim], q[..., rotary_dim:]
181
+ k_rot, k_pass = k[..., :rotary_dim], k[..., rotary_dim:]
182
+
183
+ # Apply rotary embeddings on the first half or full tensor
184
+ q_embed = (q_rot * cos) + (rotate_half(q_rot) * sin)
185
+ k_embed = (k_rot * cos) + (rotate_half(k_rot) * sin)
186
+
187
+ # Concatenate back to full shape
188
+ q_embed = torch.cat([q_embed, q_pass], dim=-1)
189
+ k_embed = torch.cat([k_embed, k_pass], dim=-1)
190
+ return q_embed, k_embed
191
+
192
+
193
+ class LLaDA2MoeMLP(nn.Module):
194
+ def __init__(self, config: LLaDA2MoeConfig, intermediate_size: int):
195
+ super().__init__()
196
+ self.config = config
197
+ self.hidden_size = config.hidden_size
198
+ self.intermediate_size = intermediate_size
199
+
200
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
201
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
202
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
203
+ self.act_fn = ACT2FN[config.hidden_act]
204
+
205
+ def forward(self, x):
206
+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
207
+
208
+
209
+ class LLaDA2MoeGate(nn.Module):
210
+ def __init__(self, config):
211
+ super().__init__()
212
+ self.config = config
213
+ self.top_k = config.num_experts_per_tok
214
+ self.num_experts = config.num_experts
215
+
216
+ self.n_group = config.n_group
217
+ self.topk_group = config.topk_group
218
+
219
+ # topk selection algorithm
220
+ self.gating_dim = config.hidden_size
221
+ self.weight = nn.Parameter(torch.empty((self.num_experts, self.gating_dim)))
222
+ self.routed_scaling_factor = config.routed_scaling_factor
223
+
224
+ self.register_buffer("expert_bias", torch.zeros((self.num_experts)))
225
+ self.reset_parameters()
226
+
227
+ def reset_parameters(self) -> None:
228
+ import torch.nn.init as init
229
+
230
+ init.kaiming_uniform_(self.weight, a=math.sqrt(5))
231
+
232
+ def group_limited_topk(
233
+ self,
234
+ scores: torch.Tensor,
235
+ ):
236
+ num_tokens, _ = scores.size()
237
+ # Organize the experts into groups
238
+ group_scores = scores.view(num_tokens, self.n_group, -1).topk(2, dim=-1)[0].sum(dim=-1)
239
+ group_idx = torch.topk(group_scores, k=self.topk_group, dim=-1, sorted=False)[1]
240
+ group_mask = torch.zeros_like(group_scores)
241
+ group_mask.scatter_(1, group_idx, 1)
242
+
243
+ # Mask the experts based on selection groups
244
+ score_mask = (
245
+ group_mask.unsqueeze(-1)
246
+ .expand(num_tokens, self.n_group, self.num_experts // self.n_group)
247
+ .reshape(num_tokens, -1)
248
+ )
249
+
250
+ masked_scores = scores.masked_fill(~score_mask.bool(), float('-inf'))
251
+ probs, top_indices = torch.topk(masked_scores, k=self.top_k, dim=-1)
252
+
253
+ return probs, top_indices
254
+
255
+ def forward(self, hidden_states):
256
+ # compute gating score
257
+ hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
258
+ logits = F.linear(hidden_states.type(torch.float32), self.weight.type(torch.float32))
259
+
260
+ scores = torch.sigmoid(logits.float()).type_as(logits)
261
+
262
+ scores_for_routing = scores + self.expert_bias
263
+ _, topk_idx = self.group_limited_topk(scores_for_routing)
264
+
265
+ scores = torch.gather(scores, dim=1, index=topk_idx).type_as(logits)
266
+
267
+ topk_weight = scores / (scores.sum(dim=-1, keepdim=True) + 1e-20) if self.top_k > 1 else scores
268
+ topk_weight = topk_weight * self.routed_scaling_factor
269
+
270
+ return topk_idx, topk_weight, logits
271
+
272
+
273
+ class LLaDA2MoeSparseMoeBlock(nn.Module):
274
+ """
275
+ A mixed expert module containing shared experts.
276
+ """
277
+
278
+ def __init__(self, config: LLaDA2MoeConfig):
279
+ super().__init__()
280
+ self.config = config
281
+ self.num_experts_per_tok = config.num_experts_per_tok
282
+ self._setup_experts()
283
+ self.gate = LLaDA2MoeGate(config)
284
+ if config.num_shared_experts is not None:
285
+ self.shared_experts = LLaDA2MoeMLP(
286
+ config=config, intermediate_size=config.moe_intermediate_size * config.num_shared_experts
287
+ )
288
+
289
+ def _setup_experts(self):
290
+ self.experts = nn.ModuleList(
291
+ [
292
+ LLaDA2MoeMLP(config=self.config, intermediate_size=self.config.moe_intermediate_size)
293
+ for _ in range(self.config.num_experts)
294
+ ]
295
+ )
296
+
297
+ def forward(self, hidden_states):
298
+ identity = hidden_states
299
+ bsz, seq_len, h = hidden_states.shape
300
+ topk_idx, topk_weight, router_logits = self.gate(hidden_states)
301
+ hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
302
+ flat_topk_idx = topk_idx.view(-1)
303
+ if self.training:
304
+ hidden_states = hidden_states.repeat_interleave(self.num_experts_per_tok, dim=0)
305
+ y = torch.empty_like(hidden_states)
306
+ for i, expert in enumerate(self.experts):
307
+ y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])
308
+ y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)
309
+ y = y.to(hidden_states.dtype).view(bsz, seq_len, h)
310
+ else:
311
+ y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(bsz, seq_len, h)
312
+ if self.config.num_shared_experts is not None:
313
+ y = y + self.shared_experts(identity)
314
+ return y, (router_logits.view(bsz, seq_len, -1), topk_idx.view(bsz, seq_len, -1))
315
+
316
+ @torch.no_grad()
317
+ def moe_infer(self, x, topk_ids, topk_weight):
318
+ cnts = topk_ids.new_zeros((topk_ids.shape[0], len(self.experts)))
319
+ cnts.scatter_(1, topk_ids, 1)
320
+ tokens_per_expert = cnts.sum(dim=0)
321
+ idxs = topk_ids.view(-1).argsort()
322
+ sorted_tokens = x[idxs // topk_ids.shape[1]]
323
+ sorted_tokens_shape = sorted_tokens.shape
324
+ tokens_per_expert = tokens_per_expert.cpu().numpy()
325
+ outputs = []
326
+ start_idx = 0
327
+ for i, num_tokens in enumerate(tokens_per_expert):
328
+ end_idx = start_idx + num_tokens
329
+ if num_tokens == 0:
330
+ continue
331
+ expert = self.experts[i]
332
+ tokens_for_this_expert = sorted_tokens[start_idx:end_idx]
333
+ expert_out = expert(tokens_for_this_expert)
334
+ outputs.append(expert_out.to(x.device))
335
+ start_idx = end_idx
336
+
337
+ outs = torch.cat(outputs, dim=0) if len(outputs) else sorted_tokens.new_empty(0)
338
+ new_x = torch.empty_like(outs)
339
+ new_x[idxs] = outs
340
+ final_out = (
341
+ new_x.view(*topk_ids.shape, -1)
342
+ .type(topk_weight.dtype)
343
+ .mul_(topk_weight.unsqueeze(dim=-1))
344
+ .sum(dim=1)
345
+ .type(new_x.dtype)
346
+ )
347
+ return final_out
348
+
349
+
350
+ # Copied from transformers.models.llama.modeling_llama.repeat_kv
351
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
352
+ """
353
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
354
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
355
+ """
356
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
357
+ if n_rep == 1:
358
+ return hidden_states
359
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
360
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
361
+
362
+
363
+ # Copied from transformers.models.llama.modeling_llama.LlamaAttention with Llama->LLaDA2Moe
364
+ class LLaDA2MoeAttention(nn.Module):
365
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
366
+
367
+ def __init__(self, config: LLaDA2MoeConfig, layer_idx: Optional[int] = None):
368
+ super().__init__()
369
+ self.config = config
370
+ self.layer_idx = layer_idx
371
+ if layer_idx is None:
372
+ logger.warning_once(
373
+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
374
+ "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
375
+ "when creating this class."
376
+ )
377
+
378
+ self.attention_dropout = config.attention_dropout
379
+ self.hidden_size = config.hidden_size
380
+ self.num_heads = config.num_attention_heads
381
+ self.head_dim = config.head_dim or self.hidden_size // self.num_heads
382
+ partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
383
+ self.rope_dim = int(self.head_dim * partial_rotary_factor)
384
+ self.num_key_value_heads = config.num_key_value_heads
385
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
386
+ self.max_position_embeddings = config.max_position_embeddings
387
+ self.rope_theta = config.rope_theta
388
+ self.is_causal = False
389
+
390
+ self.query_key_value = nn.Linear(
391
+ self.hidden_size,
392
+ (self.num_heads + 2 * self.num_key_value_heads) * self.head_dim,
393
+ bias=config.use_qkv_bias,
394
+ )
395
+
396
+ self.query_layernorm = LLaDA2MoeRMSNorm(self.head_dim, eps=config.rms_norm_eps)
397
+ self.key_layernorm = LLaDA2MoeRMSNorm(self.head_dim, eps=config.rms_norm_eps)
398
+ self.dense = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.use_bias)
399
+
400
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
401
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
402
+
403
+ def forward(
404
+ self,
405
+ hidden_states: torch.Tensor,
406
+ attention_mask: Optional[torch.Tensor] = None,
407
+ position_ids: Optional[torch.LongTensor] = None,
408
+ past_key_value: Optional[Cache] = None,
409
+ output_attentions: bool = False,
410
+ use_cache: bool = False,
411
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
412
+ **kwargs,
413
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
414
+ if "padding_mask" in kwargs:
415
+ warnings.warn(
416
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
417
+ )
418
+
419
+ bsz, q_len, _ = hidden_states.size()
420
+
421
+ qkv = self.query_key_value(hidden_states)
422
+ qkv = qkv.view(bsz, q_len, self.num_heads + 2 * self.num_key_value_heads, self.head_dim)
423
+
424
+ query_states, key_states, value_states = qkv.split(
425
+ [self.num_heads, self.num_key_value_heads, self.num_key_value_heads], dim=-2
426
+ )
427
+ query_states = query_states.transpose(1, 2)
428
+ key_states = key_states.transpose(1, 2)
429
+ value_states = value_states.transpose(1, 2)
430
+
431
+ query_states = self.query_layernorm(query_states)
432
+ key_states = self.key_layernorm(key_states)
433
+
434
+ kv_seq_len = key_states.shape[-2]
435
+ if past_key_value is not None:
436
+ if self.layer_idx is None:
437
+ raise ValueError(
438
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
439
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
440
+ "with a layer index."
441
+ )
442
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
443
+ cos, sin = position_embeddings
444
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
445
+
446
+ if past_key_value is not None:
447
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
448
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
449
+
450
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
451
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
452
+
453
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
454
+
455
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
456
+ raise ValueError(
457
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
458
+ f" {attn_weights.size()}"
459
+ )
460
+ # attention_mask = None
461
+ if attention_mask is not None:
462
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
463
+ raise ValueError(
464
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
465
+ )
466
+ attn_weights = attn_weights + attention_mask
467
+
468
+ # upcast attention to fp32
469
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
470
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
471
+ attn_output = torch.matmul(attn_weights, value_states)
472
+
473
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
474
+ raise ValueError(
475
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
476
+ f" {attn_output.size()}"
477
+ )
478
+
479
+ attn_output = attn_output.transpose(1, 2).contiguous()
480
+
481
+ attn_output = attn_output.reshape(bsz, q_len, -1)
482
+
483
+ attn_output = self.dense(attn_output)
484
+
485
+ if not output_attentions:
486
+ attn_weights = None
487
+
488
+ return attn_output, attn_weights, past_key_value
489
+
490
+
491
+ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2 with Llama->LLaDA2Moe
492
+ class LLaDA2MoeFlashAttention2(LLaDA2MoeAttention):
493
+ """
494
+ LLaDA2Moe flash attention module. This module inherits from `LLaDA2MoeAttention` as the weights of the module stays
495
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
496
+ flash attention and deal with padding tokens in case the input contains any of them.
497
+ """
498
+
499
+ def __init__(self, *args, **kwargs):
500
+ super().__init__(*args, **kwargs)
501
+
502
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
503
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
504
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
505
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
506
+
507
+ def forward(
508
+ self,
509
+ hidden_states: torch.Tensor,
510
+ attention_mask: Optional[torch.LongTensor] = None,
511
+ position_ids: Optional[torch.LongTensor] = None,
512
+ past_key_value: Optional[Cache] = None,
513
+ output_attentions: bool = False,
514
+ use_cache: bool = False,
515
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
516
+ **kwargs,
517
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
518
+ # LLaDA2MoeFlashAttention2 attention does not support output_attentions
519
+ if "padding_mask" in kwargs:
520
+ warnings.warn(
521
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
522
+ )
523
+
524
+ # overwrite attention_mask with padding_mask
525
+ attention_mask = kwargs.pop("padding_mask")
526
+
527
+ output_attentions = False
528
+
529
+ bsz, q_len, _ = hidden_states.size()
530
+
531
+ # Flash attention requires the input to have the shape
532
+ # batch_size x seq_length x head_dim x hidden_dim
533
+ # therefore we just need to keep the original shape
534
+
535
+ qkv = self.query_key_value(hidden_states)
536
+ qkv = qkv.view(bsz, q_len, self.num_heads + 2 * self.num_key_value_heads, self.head_dim)
537
+
538
+ query_states, key_states, value_states = qkv.split(
539
+ [self.num_heads, self.num_key_value_heads, self.num_key_value_heads], dim=-2
540
+ )
541
+ query_states = query_states.transpose(1, 2)
542
+ key_states = key_states.transpose(1, 2)
543
+ value_states = value_states.transpose(1, 2)
544
+
545
+ query_states = self.query_layernorm(query_states)
546
+ key_states = self.key_layernorm(key_states)
547
+
548
+ kv_seq_len = key_states.shape[-2]
549
+ if past_key_value is not None:
550
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
551
+ cos, sin = position_embeddings
552
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
553
+
554
+ if past_key_value is not None:
555
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
556
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
557
+
558
+ # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
559
+ # to be able to avoid many of these transpose/reshape/view.
560
+ query_states = query_states.transpose(1, 2)
561
+ key_states = key_states.transpose(1, 2)
562
+ value_states = value_states.transpose(1, 2)
563
+
564
+ dropout_rate = self.attention_dropout if self.training else 0.0
565
+
566
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
567
+ # therefore the input hidden states gets silently cast in float32. Hence, we need
568
+ # cast them back in the correct dtype just to be sure everything works as expected.
569
+ # This might slow down training & inference so it is recommended to not cast the LayerNorms
570
+ # in fp32. (LLaDA2MoeRMSNorm handles it correctly)
571
+
572
+ input_dtype = query_states.dtype
573
+ if input_dtype == torch.float32:
574
+ # Handle the case where the model is quantized
575
+ if hasattr(self.config, "_pre_quantization_dtype"):
576
+ target_dtype = self.config._pre_quantization_dtype
577
+ elif torch.is_autocast_enabled():
578
+ target_dtype = torch.get_autocast_gpu_dtype()
579
+ else:
580
+ target_dtype = self.query_key_value.weight.dtype
581
+
582
+ logger.warning_once(
583
+ f"The input hidden states seems to be silently casted in float32, this might be related to"
584
+ f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
585
+ f" {target_dtype}."
586
+ )
587
+
588
+ query_states = query_states.to(target_dtype)
589
+ key_states = key_states.to(target_dtype)
590
+ value_states = value_states.to(target_dtype)
591
+
592
+ attn_output = self._flash_attention_forward(
593
+ query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate
594
+ )
595
+
596
+ attn_output = attn_output.reshape(bsz, q_len, -1).contiguous()
597
+ attn_output = self.dense(attn_output)
598
+
599
+ if not output_attentions:
600
+ attn_weights = None
601
+
602
+ return attn_output, attn_weights, past_key_value
603
+
604
+ def _flash_attention_forward(
605
+ self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
606
+ ):
607
+ """
608
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
609
+ first unpad the input, then computes the attention scores and pad the final attention scores.
610
+
611
+ Args:
612
+ query_states (`torch.Tensor`):
613
+ Input query states to be passed to Flash Attention API
614
+ key_states (`torch.Tensor`):
615
+ Input key states to be passed to Flash Attention API
616
+ value_states (`torch.Tensor`):
617
+ Input value states to be passed to Flash Attention API
618
+ attention_mask (`torch.Tensor`):
619
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
620
+ position of padding tokens and 1 for the position of non-padding tokens.
621
+ dropout (`int`, *optional*):
622
+ Attention dropout
623
+ softmax_scale (`float`, *optional*):
624
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
625
+ query_length (`int`):
626
+ The length of the query sequence in terms of tokens. This represents the number of tokens in the
627
+ `query_states` tensor along the sequence dimension. It is used to determine the effective sequence
628
+ length for attention computations.
629
+ """
630
+ if not self._flash_attn_uses_top_left_mask:
631
+ causal = self.is_causal
632
+ else:
633
+ # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LLaDA2MoeFlashAttention2 __init__.
634
+ causal = self.is_causal and query_length != 1
635
+
636
+ # attention_mask = None
637
+ # Contains at least one padding token in the sequence
638
+ if attention_mask is not None:
639
+ batch_size = query_states.shape[0]
640
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
641
+ query_states, key_states, value_states, attention_mask, query_length
642
+ )
643
+
644
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
645
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
646
+
647
+ attn_output_unpad = flash_attn_varlen_func(
648
+ query_states,
649
+ key_states,
650
+ value_states,
651
+ cu_seqlens_q=cu_seqlens_q,
652
+ cu_seqlens_k=cu_seqlens_k,
653
+ max_seqlen_q=max_seqlen_in_batch_q,
654
+ max_seqlen_k=max_seqlen_in_batch_k,
655
+ dropout_p=dropout,
656
+ softmax_scale=softmax_scale,
657
+ causal=causal,
658
+ )
659
+
660
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
661
+ else:
662
+ attn_output = flash_attn_func(
663
+ query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
664
+ )
665
+
666
+ return attn_output
667
+
668
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
669
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
670
+ batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
671
+
672
+ key_layer = index_first_axis(
673
+ key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
674
+ )
675
+ value_layer = index_first_axis(
676
+ value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
677
+ )
678
+ if query_length == kv_seq_len:
679
+ query_layer = index_first_axis(
680
+ query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
681
+ )
682
+ cu_seqlens_q = cu_seqlens_k
683
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
684
+ indices_q = indices_k
685
+ elif query_length == 1:
686
+ max_seqlen_in_batch_q = 1
687
+ cu_seqlens_q = torch.arange(
688
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
689
+ ) # There is a memcpy here, that is very bad.
690
+ indices_q = cu_seqlens_q[:-1]
691
+ query_layer = query_layer.squeeze(1)
692
+ else:
693
+ # The -q_len: slice assumes left padding.
694
+ attention_mask = attention_mask[:, -query_length:]
695
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
696
+
697
+ return (
698
+ query_layer,
699
+ key_layer,
700
+ value_layer,
701
+ indices_q,
702
+ (cu_seqlens_q, cu_seqlens_k),
703
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
704
+ )
705
+
706
+
707
+ # Copied from transformers.models.llama.modeling_llama.LlamaSdpaAttention with Llama->LLaDA2Moe
708
+ class LLaDA2MoeSdpaAttention(LLaDA2MoeAttention):
709
+ """
710
+ LLaDA2Moe attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
711
+ `LLaDA2MoeAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
712
+ SDPA API.
713
+ """
714
+
715
+ # Adapted from LLaDA2MoeAttention.forward
716
+ def forward(
717
+ self,
718
+ hidden_states: torch.Tensor,
719
+ attention_mask: Optional[torch.Tensor] = None,
720
+ position_ids: Optional[torch.LongTensor] = None,
721
+ past_key_value: Optional[Cache] = None,
722
+ output_attentions: bool = False,
723
+ use_cache: bool = False,
724
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
725
+ **kwargs,
726
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
727
+ if output_attentions:
728
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
729
+ logger.warning_once(
730
+ "LLaDA2MoeModel is using LLaDA2MoeSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
731
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
732
+ )
733
+ return super().forward(
734
+ hidden_states=hidden_states,
735
+ attention_mask=attention_mask,
736
+ position_ids=position_ids,
737
+ past_key_value=past_key_value,
738
+ output_attentions=output_attentions,
739
+ use_cache=use_cache,
740
+ )
741
+
742
+ bsz, q_len, _ = hidden_states.size()
743
+
744
+ qkv = self.query_key_value(hidden_states)
745
+ qkv = qkv.view(bsz, q_len, self.num_heads + 2 * self.num_key_value_heads, self.head_dim)
746
+
747
+ query_states, key_states, value_states = qkv.split(
748
+ [self.num_heads, self.num_key_value_heads, self.num_key_value_heads], dim=-2
749
+ )
750
+ query_states = query_states.transpose(1, 2)
751
+ key_states = key_states.transpose(1, 2)
752
+ value_states = value_states.transpose(1, 2)
753
+
754
+ query_states = self.query_layernorm(query_states)
755
+ key_states = self.key_layernorm(key_states)
756
+
757
+ kv_seq_len = key_states.shape[-2]
758
+ if past_key_value is not None:
759
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
760
+ cos, sin = position_embeddings
761
+
762
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
763
+
764
+ if past_key_value is not None:
765
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
766
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
767
+
768
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
769
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
770
+
771
+ # attention_mask = None
772
+ if attention_mask is not None:
773
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
774
+ raise ValueError(
775
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
776
+ )
777
+
778
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
779
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
780
+ if query_states.device.type == "cuda" and attention_mask is not None:
781
+ query_states = query_states.contiguous()
782
+ key_states = key_states.contiguous()
783
+ value_states = value_states.contiguous()
784
+
785
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
786
+ query_states,
787
+ key_states,
788
+ value_states,
789
+ attn_mask=attention_mask,
790
+ dropout_p=self.attention_dropout if self.training else 0.0,
791
+ # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
792
+ is_causal=self.is_causal and attention_mask is None and q_len > 1,
793
+ )
794
+
795
+ attn_output = attn_output.transpose(1, 2).contiguous()
796
+ attn_output = attn_output.reshape(bsz, q_len, -1)
797
+
798
+ attn_output = self.dense(attn_output)
799
+
800
+ return attn_output, None, past_key_value
801
+
802
+
803
+ ATTENTION_CLASSES = {
804
+ "eager": LLaDA2MoeAttention,
805
+ "flash_attention_2": LLaDA2MoeFlashAttention2,
806
+ "sdpa": LLaDA2MoeSdpaAttention,
807
+ }
808
+
809
+
810
+ class LLaDA2MoeDecoderLayer(nn.Module):
811
+ def __init__(self, config: LLaDA2MoeConfig, layer_idx: int):
812
+ super().__init__()
813
+ self.hidden_size = config.hidden_size
814
+
815
+ self.attention = ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)
816
+
817
+ self.mlp = (
818
+ LLaDA2MoeSparseMoeBlock(config)
819
+ if (config.num_experts is not None and layer_idx >= config.first_k_dense_replace)
820
+ else LLaDA2MoeMLP(config=config, intermediate_size=config.intermediate_size)
821
+ )
822
+ self.input_layernorm = LLaDA2MoeRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
823
+ self.post_attention_layernorm = LLaDA2MoeRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
824
+
825
+ def forward(
826
+ self,
827
+ hidden_states: torch.Tensor,
828
+ attention_mask: Optional[torch.Tensor] = None,
829
+ position_ids: Optional[torch.LongTensor] = None,
830
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
831
+ output_attentions: Optional[bool] = False,
832
+ output_router_logits: Optional[bool] = False,
833
+ use_cache: Optional[bool] = False,
834
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
835
+ **kwargs,
836
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
837
+ """
838
+ Args:
839
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
840
+ attention_mask (`torch.FloatTensor`, *optional*):
841
+ attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
842
+ query_sequence_length, key_sequence_length)` if default attention is used.
843
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
844
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
845
+ config.n_positions - 1]`.
846
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*):
847
+ cached past key and value projection states
848
+ output_attentions (`bool`, *optional*):
849
+ Whether to return the attentions tensors of all attention layers. See `attentions` under
850
+ returned tensors for more detail.
851
+ output_router_logits (`bool`, *optional*):
852
+ Whether or not to return the logits of all the routers. They are useful for computing the router loss,
853
+ and should not be returned during inference.
854
+ use_cache (`bool`, *optional*):
855
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
856
+ (see `past_key_values`).
857
+ """
858
+ if "padding_mask" in kwargs:
859
+ warnings.warn(
860
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
861
+ )
862
+ residual = hidden_states
863
+
864
+ hidden_states = self.input_layernorm(hidden_states)
865
+
866
+ # Self Attention
867
+ hidden_states, self_attn_weights, present_key_value = self.attention(
868
+ hidden_states=hidden_states,
869
+ attention_mask=attention_mask,
870
+ position_ids=position_ids,
871
+ past_key_value=past_key_value,
872
+ output_attentions=output_attentions,
873
+ position_embeddings=position_embeddings,
874
+ use_cache=use_cache,
875
+ )
876
+ hidden_states = residual + hidden_states
877
+
878
+ # Fully Connected
879
+ residual = hidden_states
880
+ hidden_states = self.post_attention_layernorm(hidden_states)
881
+ hidden_states = self.mlp(hidden_states)
882
+ if isinstance(hidden_states, tuple):
883
+ hidden_states, router_logits = hidden_states
884
+ else:
885
+ router_logits = None
886
+ hidden_states = residual + hidden_states.to(residual.device)
887
+
888
+ outputs = (hidden_states,)
889
+
890
+ if output_attentions:
891
+ outputs += (self_attn_weights,)
892
+
893
+ if use_cache:
894
+ outputs += (present_key_value,)
895
+
896
+ if output_router_logits:
897
+ outputs += (router_logits,)
898
+
899
+ return outputs
900
+
901
+
902
+ LLADA2MOE_START_DOCSTRING = r"""
903
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
904
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
905
+ etc.)
906
+
907
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
908
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
909
+ and behavior.
910
+
911
+ Parameters:
912
+ config ([`LLaDA2MoeConfig`]):
913
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
914
+ load the weights associated with the model, only the configuration. Check out the
915
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
916
+ """
917
+
918
+
919
+ @add_start_docstrings(
920
+ "The bare LLaDA2Moe Model outputting raw hidden-states without any specific head on top.",
921
+ LLADA2MOE_START_DOCSTRING,
922
+ )
923
+ class LLaDA2MoePreTrainedModel(PreTrainedModel):
924
+ config_class = LLaDA2MoeConfig
925
+ base_model_prefix = "model"
926
+ supports_gradient_checkpointing = True
927
+ _no_split_modules = ["LLaDA2MoeDecoderLayer"]
928
+ _skip_keys_device_placement = "past_key_values"
929
+ _supports_flash_attn_2 = True
930
+ _supports_sdpa = True
931
+ _supports_cache_class = True
932
+
933
+ def _init_weights(self, module):
934
+ std = self.config.initializer_range
935
+ if isinstance(module, nn.Linear):
936
+ module.weight.data.normal_(mean=0.0, std=std)
937
+ if module.bias is not None:
938
+ module.bias.data.zero_()
939
+ elif isinstance(module, nn.Embedding):
940
+ module.weight.data.normal_(mean=0.0, std=std)
941
+ if module.padding_idx is not None:
942
+ module.weight.data[module.padding_idx].zero_()
943
+
944
+
945
+ LLADA2MOE_INPUTS_DOCSTRING = r"""
946
+ Args:
947
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
948
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
949
+ it.
950
+
951
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
952
+ [`PreTrainedTokenizer.__call__`] for details.
953
+
954
+ [What are input IDs?](../glossary#input-ids)
955
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
956
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
957
+
958
+ - 1 for tokens that are **not masked**,
959
+ - 0 for tokens that are **masked**.
960
+
961
+ [What are attention masks?](../glossary#attention-mask)
962
+
963
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
964
+ [`PreTrainedTokenizer.__call__`] for details.
965
+
966
+ If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
967
+ `past_key_values`).
968
+
969
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
970
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
971
+ information on the default strategy.
972
+
973
+ - 1 indicates the head is **not masked**,
974
+ - 0 indicates the head is **masked**.
975
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
976
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
977
+ config.n_positions - 1]`.
978
+
979
+ [What are position IDs?](../glossary#position-ids)
980
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
981
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
982
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
983
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
984
+
985
+ Two formats are allowed:
986
+ - a [`~cache_utils.Cache`] instance;
987
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
988
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
989
+ cache format.
990
+
991
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
992
+ legacy cache format will be returned.
993
+
994
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
995
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
996
+ of shape `(batch_size, sequence_length)`.
997
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
998
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
999
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
1000
+ model's internal embedding lookup matrix.
1001
+ use_cache (`bool`, *optional*):
1002
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
1003
+ `past_key_values`).
1004
+ output_attentions (`bool`, *optional*):
1005
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
1006
+ tensors for more detail.
1007
+ output_hidden_states (`bool`, *optional*):
1008
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
1009
+ more detail.
1010
+ return_dict (`bool`, *optional*):
1011
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
1012
+ """
1013
+
1014
+
1015
+ @add_start_docstrings(
1016
+ "The bare LLaDA2Moe Model outputting raw hidden-states without any specific head on top.",
1017
+ LLADA2MOE_START_DOCSTRING,
1018
+ )
1019
+ class LLaDA2MoeModel(LLaDA2MoePreTrainedModel):
1020
+ """
1021
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LLaDA2MoeDecoderLayer`]
1022
+
1023
+ Args:
1024
+ config: LLaDA2MoeConfig
1025
+ """
1026
+
1027
+ def __init__(self, config: LLaDA2MoeConfig):
1028
+ super().__init__(config)
1029
+ self.padding_idx = config.pad_token_id
1030
+ self.vocab_size = config.vocab_size
1031
+
1032
+ self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
1033
+ self.layers = nn.ModuleList(
1034
+ [LLaDA2MoeDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
1035
+ )
1036
+ self._use_sdpa = config._attn_implementation == "sdpa"
1037
+ self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
1038
+ self.norm = LLaDA2MoeRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
1039
+ self.rotary_emb = LLaDA2MoeRotaryEmbedding(config=config)
1040
+ self.gradient_checkpointing = False
1041
+ # Initialize weights and apply final processing
1042
+ self.post_init()
1043
+
1044
+ def get_input_embeddings(self):
1045
+ return self.word_embeddings
1046
+
1047
+ def set_input_embeddings(self, value):
1048
+ self.word_embeddings = value
1049
+
1050
+ @add_start_docstrings_to_model_forward(LLADA2MOE_INPUTS_DOCSTRING)
1051
+ def forward(
1052
+ self,
1053
+ input_ids: torch.LongTensor = None,
1054
+ attention_mask: Optional[torch.Tensor] = None,
1055
+ position_ids: Optional[torch.LongTensor] = None,
1056
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1057
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1058
+ use_cache: Optional[bool] = None,
1059
+ output_attentions: Optional[bool] = None,
1060
+ output_hidden_states: Optional[bool] = None,
1061
+ output_router_logits: Optional[bool] = None,
1062
+ return_dict: Optional[bool] = None,
1063
+ **kwargs,
1064
+ ) -> Union[Tuple, MoeModelOutputWithPast]:
1065
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1066
+ output_hidden_states = (
1067
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1068
+ )
1069
+ output_router_logits = (
1070
+ output_router_logits if output_router_logits is not None else self.config.output_router_logits
1071
+ )
1072
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
1073
+
1074
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1075
+
1076
+ # retrieve input_ids and inputs_embeds
1077
+ if input_ids is not None and inputs_embeds is not None:
1078
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
1079
+ elif input_ids is not None:
1080
+ batch_size, seq_length = input_ids.shape[:2]
1081
+ elif inputs_embeds is not None:
1082
+ batch_size, seq_length = inputs_embeds.shape[:2]
1083
+ else:
1084
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
1085
+
1086
+ if self.gradient_checkpointing and self.training:
1087
+ if use_cache:
1088
+ logger.warning_once(
1089
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`transformers."
1090
+ )
1091
+ use_cache = False
1092
+
1093
+ past_key_values_length = 0
1094
+ if use_cache:
1095
+ use_legacy_cache = not isinstance(past_key_values, Cache)
1096
+ if use_legacy_cache:
1097
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
1098
+ past_key_values_length = past_key_values.get_usable_length(seq_length)
1099
+
1100
+ if position_ids is None:
1101
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
1102
+ position_ids = torch.arange(
1103
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
1104
+ )
1105
+ position_ids = position_ids.unsqueeze(0)
1106
+
1107
+ if inputs_embeds is None:
1108
+ inputs_embeds = self.word_embeddings(input_ids)
1109
+
1110
+ # TODO flash attention 2 can not support custom attention mask
1111
+ # if self._use_flash_attention_2:
1112
+ # # 2d mask is passed through the layers
1113
+ # attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
1114
+ if self._use_sdpa and not output_attentions:
1115
+ # output_attentions=True can not be supported when using SDPA, and we fall back on
1116
+ # the manual implementation that requires a 4D causal mask in all cases.
1117
+ attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
1118
+ attention_mask,
1119
+ (batch_size, seq_length),
1120
+ inputs_embeds,
1121
+ past_key_values_length,
1122
+ )
1123
+ else:
1124
+ # 4d mask is passed through the layers
1125
+ attention_mask = _prepare_4d_causal_attention_mask(
1126
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
1127
+ )
1128
+
1129
+ # embed positions
1130
+ hidden_states = inputs_embeds
1131
+
1132
+ # create position embeddings to be shared across the decoder layers
1133
+ position_embeddings = self.rotary_emb(hidden_states, position_ids)
1134
+
1135
+ # decoder layers
1136
+ all_hidden_states = () if output_hidden_states else None
1137
+ all_self_attns = () if output_attentions else None
1138
+ all_router_logits = () if output_router_logits else None
1139
+ next_decoder_cache = None
1140
+
1141
+ for decoder_layer in self.layers:
1142
+ if output_hidden_states:
1143
+ all_hidden_states += (hidden_states,)
1144
+
1145
+ if self.gradient_checkpointing and self.training:
1146
+ layer_outputs = self._gradient_checkpointing_func(
1147
+ decoder_layer.__call__,
1148
+ hidden_states,
1149
+ attention_mask,
1150
+ position_ids,
1151
+ past_key_values,
1152
+ output_attentions,
1153
+ output_router_logits,
1154
+ use_cache,
1155
+ position_embeddings,
1156
+ )
1157
+ else:
1158
+ layer_outputs = decoder_layer(
1159
+ hidden_states,
1160
+ attention_mask=attention_mask,
1161
+ position_ids=position_ids,
1162
+ past_key_value=past_key_values,
1163
+ output_attentions=output_attentions,
1164
+ output_router_logits=output_router_logits,
1165
+ use_cache=use_cache,
1166
+ position_embeddings=position_embeddings,
1167
+ )
1168
+ hidden_states = layer_outputs[0]
1169
+
1170
+ if use_cache:
1171
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
1172
+
1173
+ if output_attentions:
1174
+ all_self_attns += (layer_outputs[1],)
1175
+
1176
+ if output_router_logits and layer_outputs[-1] is not None:
1177
+ all_router_logits += (layer_outputs[-1],)
1178
+
1179
+ hidden_states = self.norm(hidden_states)
1180
+
1181
+ # add hidden states from the last decoder layer
1182
+ if output_hidden_states:
1183
+ all_hidden_states += (hidden_states,)
1184
+
1185
+ next_cache = None
1186
+ if use_cache:
1187
+ next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
1188
+ if not return_dict:
1189
+ return tuple(
1190
+ v
1191
+ for v in [hidden_states, next_cache, all_hidden_states, all_self_attns, all_router_logits]
1192
+ if v is not None
1193
+ )
1194
+ return MoeModelOutputWithPast(
1195
+ last_hidden_state=hidden_states,
1196
+ past_key_values=next_cache,
1197
+ hidden_states=all_hidden_states,
1198
+ attentions=all_self_attns,
1199
+ router_logits=all_router_logits,
1200
+ )
1201
+
1202
+
1203
+ class LLaDA2MoeModelLM(LLaDA2MoePreTrainedModel, GenerationMixin):
1204
+ _tied_weights_keys = ["lm_head.weight"]
1205
+
1206
+ def __init__(self, config: LLaDA2MoeConfig):
1207
+ super().__init__(config)
1208
+ self.model = LLaDA2MoeModel(config)
1209
+ self.vocab_size = config.vocab_size
1210
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1211
+
1212
+ # Initialize weights and apply final processing
1213
+ self.post_init()
1214
+
1215
+ def get_input_embeddings(self):
1216
+ return self.model.word_embeddings
1217
+
1218
+ def set_input_embeddings(self, value):
1219
+ self.model.word_embeddings = value
1220
+
1221
+ def get_output_embeddings(self):
1222
+ return self.lm_head
1223
+
1224
+ def set_output_embeddings(self, new_embeddings):
1225
+ self.lm_head = new_embeddings
1226
+
1227
+ def set_decoder(self, decoder):
1228
+ self.model = decoder
1229
+
1230
+ def get_decoder(self):
1231
+ return self.model
1232
+
1233
+ @add_start_docstrings_to_model_forward(LLADA2MOE_INPUTS_DOCSTRING)
1234
+ @replace_return_docstrings(output_type=MoeCausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1235
+ def forward(
1236
+ self,
1237
+ input_ids: torch.LongTensor = None,
1238
+ attention_mask: Optional[torch.Tensor] = None,
1239
+ position_ids: Optional[torch.LongTensor] = None,
1240
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1241
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1242
+ labels: Optional[torch.LongTensor] = None,
1243
+ use_cache: Optional[bool] = None,
1244
+ output_attentions: Optional[bool] = None,
1245
+ output_hidden_states: Optional[bool] = None,
1246
+ output_router_logits: Optional[bool] = None,
1247
+ return_dict: Optional[bool] = None,
1248
+ **kwargs,
1249
+ ) -> Union[Tuple, MoeCausalLMOutputWithPast]:
1250
+ r"""
1251
+ Args:
1252
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1253
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1254
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1255
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1256
+
1257
+ Returns:
1258
+
1259
+ Example:
1260
+
1261
+ ```python
1262
+ >>> from transformers import AutoTokenizer
1263
+
1264
+ >>> model = LLaDA2MoeForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
1265
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
1266
+
1267
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1268
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1269
+
1270
+ >>> # Generate
1271
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1272
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1273
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1274
+ ```"""
1275
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1276
+ output_hidden_states = (
1277
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1278
+ )
1279
+ output_router_logits = (
1280
+ output_router_logits if output_router_logits is not None else self.config.output_router_logits
1281
+ )
1282
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1283
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1284
+ outputs = self.model(
1285
+ input_ids=input_ids,
1286
+ attention_mask=attention_mask,
1287
+ position_ids=position_ids,
1288
+ past_key_values=past_key_values,
1289
+ inputs_embeds=inputs_embeds,
1290
+ use_cache=use_cache,
1291
+ output_attentions=output_attentions,
1292
+ output_hidden_states=output_hidden_states,
1293
+ output_router_logits=output_router_logits,
1294
+ return_dict=return_dict,
1295
+ **kwargs,
1296
+ )
1297
+
1298
+ hidden_states = outputs[0]
1299
+
1300
+ logits = self.lm_head(hidden_states)
1301
+ logits = logits.float()
1302
+
1303
+ loss = None
1304
+ aux_loss = None
1305
+
1306
+ if labels is not None:
1307
+ # Shift so that tokens < n predict n
1308
+ shift_logits = logits[..., :-1, :].contiguous()
1309
+ shift_labels = labels[..., 1:].contiguous()
1310
+ # Flatten the tokens
1311
+ loss_fct = CrossEntropyLoss()
1312
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1313
+ shift_labels = shift_labels.view(-1)
1314
+ # Enable model parallelism
1315
+ shift_labels = shift_labels.to(shift_logits.device)
1316
+ loss = loss_fct(shift_logits, shift_labels)
1317
+
1318
+ if not return_dict:
1319
+ output = (logits,) + outputs[1:]
1320
+ if output_router_logits:
1321
+ output = (aux_loss,) + output
1322
+ return (loss,) + output if loss is not None else output
1323
+
1324
+ return MoeCausalLMOutputWithPast(
1325
+ loss=loss,
1326
+ aux_loss=aux_loss,
1327
+ logits=logits,
1328
+ past_key_values=outputs.past_key_values,
1329
+ hidden_states=outputs.hidden_states,
1330
+ attentions=outputs.attentions,
1331
+ router_logits=outputs.router_logits,
1332
+ )
1333
+
1334
+ def prepare_inputs_for_generation(
1335
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, token_type_ids=None, **kwargs
1336
+ ):
1337
+ if past_key_values is not None:
1338
+ if isinstance(past_key_values, Cache):
1339
+ cache_length = past_key_values.get_seq_length()
1340
+ past_length = past_key_values.seen_tokens
1341
+ max_cache_length = (
1342
+ past_key_values.get_max_length()
1343
+ if hasattr(past_key_values, "get_max_length")
1344
+ else past_key_values.get_max_cache_shape()
1345
+ )
1346
+ else:
1347
+ cache_length = past_length = past_key_values[0][0].shape[2]
1348
+ max_cache_length = None
1349
+
1350
+ # Keep only the unprocessed tokens:
1351
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
1352
+ # some of the inputs are exclusivelly passed as part of the cache (e.g. when passing input_embeds as input)
1353
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
1354
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
1355
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
1356
+ # input_ids based on the past_length.
1357
+ elif past_length < input_ids.shape[1]:
1358
+ input_ids = input_ids[:, past_length:]
1359
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
1360
+
1361
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
1362
+ if (
1363
+ max_cache_length is not None
1364
+ and attention_mask is not None
1365
+ and cache_length + input_ids.shape[1] > max_cache_length
1366
+ ):
1367
+ attention_mask = attention_mask[:, -max_cache_length:]
1368
+
1369
+ position_ids = kwargs.get("position_ids", None)
1370
+ if attention_mask is not None and position_ids is None:
1371
+ # create position_ids on the fly for batch generation
1372
+ position_ids = attention_mask.long().cumsum(-1) - 1
1373
+ position_ids.masked_fill_(attention_mask == 0, 1)
1374
+ if past_key_values:
1375
+ position_ids = position_ids[:, -input_ids.shape[1] :]
1376
+
1377
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1378
+ if inputs_embeds is not None and past_key_values is None:
1379
+ model_inputs = {"inputs_embeds": inputs_embeds}
1380
+ else:
1381
+ model_inputs = {"input_ids": input_ids}
1382
+
1383
+ model_inputs.update(
1384
+ {
1385
+ "position_ids": position_ids,
1386
+ "past_key_values": past_key_values,
1387
+ "use_cache": kwargs.get("use_cache"),
1388
+ "attention_mask": attention_mask,
1389
+ }
1390
+ )
1391
+ return model_inputs
1392
+
1393
+ @staticmethod
1394
+ def _reorder_cache(past_key_values, beam_idx):
1395
+ reordered_past = ()
1396
+ for layer_past in past_key_values:
1397
+ reordered_past += (
1398
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1399
+ )
1400
+ return reordered_past
1401
+
1402
+ @staticmethod
1403
+ def _top_k_logits(logits, k):
1404
+ if k is None or k <= 0:
1405
+ return logits
1406
+ else:
1407
+ values, _ = torch.topk(logits, k)
1408
+ min_values = values[..., -1, None]
1409
+ return torch.where(
1410
+ logits < min_values, torch.full_like(logits, float("-inf")), logits
1411
+ )
1412
+
1413
+ @staticmethod
1414
+ def _top_p_logits(logits, p):
1415
+ if p is None or p >= 1.0:
1416
+ return logits
1417
+ sorted_logits, sorted_indices = torch.sort(logits, descending=True)
1418
+ cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
1419
+ sorted_mask = cumulative_probs > p
1420
+ sorted_mask[..., 1:] = sorted_mask[..., :-1].clone()
1421
+ sorted_mask[..., 0] = False
1422
+ mask_indices = torch.scatter(
1423
+ torch.full_like(logits, False, dtype=torch.bool),
1424
+ -1,
1425
+ sorted_indices,
1426
+ sorted_mask,
1427
+ )
1428
+ return logits.masked_fill(mask_indices, float("-inf"))
1429
+
1430
+ def _sample_with_temperature_topk_topp(self, logits, temperature=1.0, top_k=0, top_p=1.0):
1431
+ orig_shape = logits.shape[:-1]
1432
+ vocab_size = logits.shape[-1]
1433
+ logits = logits.reshape(-1, vocab_size)
1434
+ if temperature > 0 and temperature != 1.0:
1435
+ logits = logits / temperature
1436
+ logits = self._top_k_logits(logits, top_k)
1437
+ logits = self._top_p_logits(logits, top_p)
1438
+ probs = F.softmax(logits, dim=-1)
1439
+ token = torch.multinomial(probs, num_samples=1)
1440
+ token_prob = torch.gather(probs, -1, token)
1441
+ return token.view(*orig_shape), token_prob.view(*orig_shape)
1442
+
1443
+ @staticmethod
1444
+ def _get_num_transfer_tokens(block_length, steps):
1445
+ if steps == 0:
1446
+ return torch.tensor([], dtype=torch.int64)
1447
+ base = block_length // steps
1448
+ remainder = block_length % steps
1449
+ num_transfer_tokens = torch.full((steps,), base, dtype=torch.int64)
1450
+ num_transfer_tokens[:remainder] += 1
1451
+ return num_transfer_tokens
1452
+
1453
+ @torch.no_grad()
1454
+ def generate(
1455
+ self,
1456
+ inputs: Optional[torch.Tensor] = None,
1457
+ temperature: int = 0.0,
1458
+ block_length: int = 32,
1459
+ steps: int = 32,
1460
+ gen_length: int = 2048,
1461
+ top_p: Optional[int] = None,
1462
+ top_k: Optional[int] = None,
1463
+ eos_early_stop: bool = False,
1464
+ minimal_topk: int = 1,
1465
+ threshold: float = 0.95,
1466
+ eos_id: int = 156892,
1467
+ mask_id: int = 156895,
1468
+ ):
1469
+ r"""
1470
+ Generates tokens using a block-wise, iterative refinement strategy.
1471
+
1472
+ This method operates differently from standard autoregressive generation. It first creates a template of the
1473
+ full desired length, filled with a special `mask_id`. It then processes this template in segments (`blocks`)
1474
+ and iteratively "denoises" or "refines" the `mask_id` tokens into actual tokens over a series of `steps` for
1475
+ each block. A custom block-diagonal causal attention mask ensures that generation within a block can attend to
1476
+ all previous blocks but not future ones.
1477
+
1478
+ <Tip warning={true}>
1479
+
1480
+ This is a specialized generation method. The quality and speed of the output are highly dependent on the interplay
1481
+ between `block_length`, `steps`, and `threshold`. It aims to achieve faster generation through parallel
1482
+ decoding within blocks, which is a departure from the token-by-token generation of standard `.generate()` methods.
1483
+
1484
+ </Tip>
1485
+
1486
+ Parameters:
1487
+ inputs (`torch.Tensor`):
1488
+ The token sequence used as a prompt for the generation.
1489
+ temperature (`float`, *optional*, defaults to 0.0):
1490
+ The value used to module the next token probabilities. A value of 0.0 corresponds to greedy decoding.
1491
+ block_length (`int`, *optional*, defaults to 32):
1492
+ The size of each generation block. The model generates text in parallel within these blocks. This is a
1493
+ key parameter for controlling the granularity of the generation process.
1494
+ steps (`int`, *optional*, defaults to 32):
1495
+ The number of iterative refinement (or "denoising") steps to perform for each block. Within each block,
1496
+ the model will try to replace `mask_id` tokens with real tokens for this many iterations.
1497
+ gen_length (`int`, *optional*, defaults to 2048):
1498
+ The maximum number of tokens to generate, excluding the prompt.
1499
+ top_p (`float`, *optional*):
1500
+ If set to a float value between 0 and 1, only the most probable tokens with probabilities that add up to
1501
+ `top_p` or higher are kept for generation (nucleus sampling).
1502
+ top_k (`int`, *optional*):
1503
+ The number of highest probability vocabulary tokens to keep for top-k-filtering.
1504
+ eos_early_stop (`bool`, *optional*, defaults to `False`):
1505
+ If `True`, generation will stop as soon as a valid End-Of-Sequence token is generated and confirmed,
1506
+ even if `gen_length` has not been reached.
1507
+ minimal_topk (`int`, *optional*, defaults to 1):
1508
+ A parameter used to dynamically adjust the number of refinement `steps`. The effective number of steps
1509
+ is capped at `gen_length // minimal_topk`.
1510
+ threshold (`float`, *optional*, defaults to 0.95):
1511
+ The confidence probability threshold for accepting a sampled token. During each refinement step, a
1512
+ sampled token is only kept if its probability is above this threshold. If not enough tokens meet the
1513
+ threshold, the ones with the highest confidence are chosen.
1514
+ eos_id (`int`, *optional*, defaults to 156892):
1515
+ The token ID for the end-of-sequence token. Used for `eos_early_stop`.
1516
+ mask_id (`int`, *optional*, defaults to 156895):
1517
+ The token ID used as a placeholder for tokens that are yet to be generated. This is central to the
1518
+ iterative refinement algorithm.
1519
+
1520
+ Return:
1521
+ `torch.Tensor`: A string containing the generated token IDs, starting
1522
+ after the prompt and stopping at the first `eos_id` or `gen_length`.
1523
+ """
1524
+ steps = min(steps, gen_length // minimal_topk)
1525
+ input_ids = inputs.to(self.device)
1526
+
1527
+ prompt_length = input_ids.shape[1]
1528
+ num_blocks = (prompt_length + gen_length + block_length - 1) // block_length
1529
+ total_length = num_blocks * block_length
1530
+
1531
+ block_mask = torch.tril(torch.ones(num_blocks, num_blocks, device=self.device))
1532
+ block_diffusion_attention_mask = (
1533
+ block_mask.repeat_interleave(block_length, dim=0)
1534
+ .repeat_interleave(block_length, dim=1)
1535
+ .unsqueeze(0)
1536
+ .unsqueeze(0)
1537
+ ).bool()
1538
+ block_diffusion_attention_mask = torch.where(
1539
+ block_diffusion_attention_mask, 0.0, float("-inf")
1540
+ ).to(torch.bfloat16)
1541
+
1542
+ position_ids = torch.arange(total_length, device=self.device).unsqueeze(0)
1543
+ x = torch.full((1, total_length), mask_id, dtype=torch.long, device=self.device)
1544
+ x[:, :prompt_length] = input_ids.clone()
1545
+
1546
+ prompt_index_full = torch.zeros_like(x, dtype=torch.bool)
1547
+ prompt_index_full[:, :prompt_length] = True
1548
+
1549
+ prefill_blocks = prompt_length // block_length
1550
+
1551
+ denoising_steps_per_block = steps
1552
+ num_transfer_tokens_schedule = self._get_num_transfer_tokens(
1553
+ block_length, denoising_steps_per_block
1554
+ )
1555
+ for num_block in range(prefill_blocks, num_blocks):
1556
+ current_window_end = (num_block + 1) * block_length
1557
+ cur_x = x[:, :current_window_end]
1558
+ cur_attn_mask = block_diffusion_attention_mask[
1559
+ :, :, :current_window_end, :current_window_end
1560
+ ]
1561
+ cur_position_ids = position_ids[:, :current_window_end]
1562
+
1563
+ for step in range(denoising_steps_per_block):
1564
+ active_block_mask = cur_x[:, -block_length:] == mask_id
1565
+ if active_block_mask.sum() == 0:
1566
+ break
1567
+
1568
+ logits = self.forward(
1569
+ cur_x,
1570
+ attention_mask=cur_attn_mask,
1571
+ position_ids=cur_position_ids,
1572
+ ).logits
1573
+
1574
+ active_logits = logits[:, -block_length:, :]
1575
+ x0, x0_p = self._sample_with_temperature_topk_topp(
1576
+ active_logits, temperature=temperature, top_k=top_k, top_p=top_p
1577
+ )
1578
+
1579
+ num_to_transfer = num_transfer_tokens_schedule[step].item()
1580
+ transfer_index = torch.zeros_like(x0, dtype=torch.bool)
1581
+
1582
+ confidence = torch.where(active_block_mask, x0_p, -torch.inf)
1583
+ high_conf_mask = confidence[0] > threshold
1584
+ num_high_confidence = high_conf_mask.sum().item()
1585
+
1586
+ if num_high_confidence >= num_to_transfer:
1587
+ transfer_index[0] = high_conf_mask
1588
+ else:
1589
+ _, idx = torch.topk(
1590
+ confidence[0],
1591
+ k=min(num_to_transfer, active_block_mask.sum().item()),
1592
+ )
1593
+ transfer_index[0, idx] = True
1594
+
1595
+ if transfer_index.any():
1596
+ cur_x[:, -block_length:][transfer_index] = x0[transfer_index]
1597
+ if eos_early_stop and (x0[transfer_index] == eos_id).any():
1598
+ eos_pos_in_x = (cur_x[0] == eos_id).nonzero(as_tuple=True)
1599
+ if len(eos_pos_in_x[0]) > 0:
1600
+ eos_pos = eos_pos_in_x[0][0].item()
1601
+ if (cur_x[0, prompt_length:eos_pos] != mask_id).all():
1602
+ final_x = x[:, :total_length][:, : eos_pos + 1]
1603
+ return final_x
1604
+
1605
+ x[:, :current_window_end] = cur_x
1606
+ if (
1607
+ eos_id is not None
1608
+ and (x[0, prompt_length:current_window_end] == eos_id).any()
1609
+ ):
1610
+ break
1611
+
1612
+ generated_answer = x[:, : prompt_length + gen_length]
1613
+
1614
+ mask_positions = (generated_answer[0][input_ids.shape[1] :] == eos_id).nonzero(
1615
+ as_tuple=True
1616
+ )[0]
1617
+ if len(mask_positions) > 0:
1618
+ first_mask_position = mask_positions[0].item()
1619
+ else:
1620
+ first_mask_position = gen_length
1621
+ return generated_answer[:, input_ids.shape[1] : input_ids.shape[1] + first_mask_position + 1]
special_tokens_map.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|startoftext|>",
3
+ "cls_token": "[CLS]",
4
+ "eos_token": "<|endoftext|>",
5
+ "gmask_token": "[gMASK]",
6
+ "pad_token": "<|endoftext|>",
7
+ "mask_token": "<|mask|>"
8
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff