OpenMOSE/RWKV-Qwen3-30B-A3B-Instruct-hxa07d — Model Card
Model Summary
OpenMOSE/RWKV-Qwen3-30B-A3B-Instruct-hxa07d is a 30B-parameter instruction-tuned hybrid model built from Alibaba’s Qwen3-30B-A3B MoE Instruct (2507), converted into a new RWKV hxa07D + NoPEAttention architecture.
The core design goal is simple and aggressive:
Keep attention layers to the absolute minimum (3 layers total), and make the remaining 45/48 layers linear-time RWKV, achieving 93.75% KV-cache reduction compared to a fully-attention 48-layer stack.
This model is the continuation of an iterative research direction discussed throughout earlier development threads: pushing practical long-context + high-throughput inference by combining RWKV-style O(n) recurrence with a small number of strategically placed attention layers, while relying on distillation (RADLADS) to preserve instruction-following quality and reasoning.
Key Facts
- Repository / Name:
OpenMOSE/RWKV-Qwen3-30B-A3B-Instruct-hxa07d - Base (converted from): Alibaba Qwen3-30B-A3B MoE Instruct (2507)
- Total parameters: 30B
- Total layers: 48
- RWKV layers: 45
- NoPEAttention layers: 3
- KV-cache reduction: 93.75% (vs. attention on all 48 layers)
Why This Model Exists
This model is built for the same “local-first high performance” direction repeatedly emphasized in prior design discussions:
- Make high-end models accessible locally, not only in large-server environments.
- Avoid the KV-cache cost wall that typically dominates long-context inference.
- Keep a minimal amount of attention only where it matters, and let RWKV do the heavy lifting with linear-time compute.
In earlier work, we explored:
- hybridizing Transformer blocks with RWKV TimeMix variants,
- making RWKV more “teacher-aligned” via GQA-like structure and weight inheritance where helpful,
- tuning stability, precision behavior, and inference throughput through kernel-level optimization and careful distillation staging.
This release consolidates that direction into a clean architecture statement: 3 NoPEAttention layers is enough—the rest should be fast.
Architecture
Overview: RWKV hxa07D + NoPEAttention Hybrid
The network is a 48-layer stack:
- 45 layers: RWKV hxa07D (linear-time recurrence)
- 3 layers: NoPEAttention (attention without positional embeddings)
This architecture intentionally limits attention depth to reduce memory pressure and keep inference scalable.
hxa07D RWKV Highlights
hxa07D is based on an improved RWKV v7 lineage, with additional stability and retention improvements:
- Improved RWKV v7-based core
- k,v residual connections Helps preserve information flow and makes student-teacher transfer smoother.
- Big Decay (higher “forgetting precision”) A retention-focused decay design aimed at improving long-range stability without relying on KV cache.
NoPEAttention (3 layers)
- NoPEAttention layers are included to retain selective global interaction capability.
- Because only 3/48 layers use attention-style KV cache, overall KV cache usage is drastically reduced.
Distillation Method: RADLADS (SmerkyG)
The teacher → student conversion and training pipeline is based on RADLADS, proposed by SmerkyG.
While exact staging and loss composition can vary by run, the guiding principle remains:
- Make the student behave like the teacher (logits and/or hidden dynamics),
- while the architecture is intentionally different (RWKV-heavy with minimal attention),
- and maintain instruction-following quality under the new inference constraints.
Performance
Despite restricting NoPEAttention to only 3 layers, the model is designed to maintain strong instruction performance through distillation and architectural alignment techniques.
| Category | Benchmark | Score | Notes |
|---|---|---|---|
| Reasoning | MMLU | 75.70% | |
| Math | GSM8K | 81.58% | |
| Long Context | PassKey / Needle | 85k | KV cache advantage should show here |
Intended Use
Best for
- Local inference where KV-cache memory is the limiting factor
- Longer contexts under constrained VRAM (relative to full-attention 30B-class models)
- High-throughput decoding workloads (chat, agents, batch inference)
Not specifically optimized for
- Tasks that require full deep attention at every layer (some niche reasoning patterns may benefit from more attention depth)
- Safety-critical domains without additional alignment and evaluation (medical/legal/financial advice)
Limitations & Known Considerations
- Hybrid trade-off: With only 3 attention layers, some behaviors that emerge from deep attention stacks may differ.
- Distillation dependence: Final quality is strongly tied to the distillation recipe and data mixture.
- Long-context behavior: Big Decay and minimal KV-cache can improve practicality, but long-context quality should be validated with dedicated tests (PassKey / Needle-style).
Bias, Safety, and Responsible Use
This model inherits typical biases and failure modes from large-scale web-trained instruction models. Use standard best practices:
- Add system policies and tool constraints for agentic use
- Avoid over-trusting outputs in high-stakes situations
- Evaluate on your target languages/domains and apply additional alignment if needed
How to Use
Quick tips
- If your runtime supports it, prefer settings optimized for RWKV-heavy decoding (kernel-optimized recurrent path).
- Expect significantly reduced KV-cache VRAM needs versus full-attention equivalents.
Example (pseudo-code)
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "OpenMOSE/RWKV-Qwen3-30B-A3B-Instruct-hxa07d"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
prompt = "Explain the key idea of limiting attention layers to 3 in a RWKV hybrid."
inputs = tok(prompt, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=256)
print(tok.decode(out[0], skip_special_tokens=True))
(Adjust to your actual runtime / custom loader depending on how the hybrid is implemented.)
Acknowledgements
This architecture research and implementation was made possible with computing power and technical support from Recursal AI. We sincerely thank them for enabling this work.
Distillation methodology is based on RADLADS, proposed by SmerkyG, whose ideas significantly influenced the training pipeline.
Citation
If you use this model in research or products, please cite the model repository and credit the contributors and supporting organizations.
2025 OpenMOSE
- Downloads last month
- 45