OpenMOSE/RWKV-Qwen3-30B-A3B-Instruct-hxa07d — Model Card

Model Summary

OpenMOSE/RWKV-Qwen3-30B-A3B-Instruct-hxa07d is a 30B-parameter instruction-tuned hybrid model built from Alibaba’s Qwen3-30B-A3B MoE Instruct (2507), converted into a new RWKV hxa07D + NoPEAttention architecture.

The core design goal is simple and aggressive:

Keep attention layers to the absolute minimum (3 layers total), and make the remaining 45/48 layers linear-time RWKV, achieving 93.75% KV-cache reduction compared to a fully-attention 48-layer stack.

This model is the continuation of an iterative research direction discussed throughout earlier development threads: pushing practical long-context + high-throughput inference by combining RWKV-style O(n) recurrence with a small number of strategically placed attention layers, while relying on distillation (RADLADS) to preserve instruction-following quality and reasoning.

Key Facts

Repository / Name: OpenMOSE/RWKV-Qwen3-30B-A3B-Instruct-hxa07d
Base (converted from): Alibaba Qwen3-30B-A3B MoE Instruct (2507)
Total parameters: 30B
Total layers: 48
RWKV layers: 45
NoPEAttention layers: 3
KV-cache reduction: 93.75% (vs. attention on all 48 layers)

Why This Model Exists

This model is built for the same “local-first high performance” direction repeatedly emphasized in prior design discussions:

Make high-end models accessible locally, not only in large-server environments.
Avoid the KV-cache cost wall that typically dominates long-context inference.
Keep a minimal amount of attention only where it matters, and let RWKV do the heavy lifting with linear-time compute.

In earlier work, we explored:

hybridizing Transformer blocks with RWKV TimeMix variants,
making RWKV more “teacher-aligned” via GQA-like structure and weight inheritance where helpful,
tuning stability, precision behavior, and inference throughput through kernel-level optimization and careful distillation staging.

This release consolidates that direction into a clean architecture statement: 3 NoPEAttention layers is enough—the rest should be fast.

Architecture

Overview: RWKV hxa07D + NoPEAttention Hybrid

The network is a 48-layer stack:

45 layers: RWKV hxa07D (linear-time recurrence)
3 layers: NoPEAttention (attention without positional embeddings)

This architecture intentionally limits attention depth to reduce memory pressure and keep inference scalable.

hxa07D RWKV Highlights

hxa07D is based on an improved RWKV v7 lineage, with additional stability and retention improvements:

Improved RWKV v7-based core
k,v residual connections Helps preserve information flow and makes student-teacher transfer smoother.
Big Decay (higher “forgetting precision”) A retention-focused decay design aimed at improving long-range stability without relying on KV cache.

NoPEAttention (3 layers)

NoPEAttention layers are included to retain selective global interaction capability.
Because only 3/48 layers use attention-style KV cache, overall KV cache usage is drastically reduced.

Distillation Method: RADLADS (SmerkyG)

The teacher → student conversion and training pipeline is based on RADLADS, proposed by SmerkyG.

While exact staging and loss composition can vary by run, the guiding principle remains:

Make the student behave like the teacher (logits and/or hidden dynamics),
while the architecture is intentionally different (RWKV-heavy with minimal attention),
and maintain instruction-following quality under the new inference constraints.

Performance

Despite restricting NoPEAttention to only 3 layers, the model is designed to maintain strong instruction performance through distillation and architectural alignment techniques.

Category	Benchmark	Score	Notes
Reasoning	MMLU	75.70%
Math	GSM8K	81.58%
Long Context	PassKey / Needle	85k	KV cache advantage should show here

Intended Use

Best for

Local inference where KV-cache memory is the limiting factor
Longer contexts under constrained VRAM (relative to full-attention 30B-class models)
High-throughput decoding workloads (chat, agents, batch inference)

Not specifically optimized for

Tasks that require full deep attention at every layer (some niche reasoning patterns may benefit from more attention depth)
Safety-critical domains without additional alignment and evaluation (medical/legal/financial advice)

Limitations & Known Considerations

Hybrid trade-off: With only 3 attention layers, some behaviors that emerge from deep attention stacks may differ.
Distillation dependence: Final quality is strongly tied to the distillation recipe and data mixture.
Long-context behavior: Big Decay and minimal KV-cache can improve practicality, but long-context quality should be validated with dedicated tests (PassKey / Needle-style).

Bias, Safety, and Responsible Use

This model inherits typical biases and failure modes from large-scale web-trained instruction models. Use standard best practices:

Add system policies and tool constraints for agentic use
Avoid over-trusting outputs in high-stakes situations
Evaluate on your target languages/domains and apply additional alignment if needed

How to Use

Quick tips

If your runtime supports it, prefer settings optimized for RWKV-heavy decoding (kernel-optimized recurrent path).
Expect significantly reduced KV-cache VRAM needs versus full-attention equivalents.

Example (pseudo-code)

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "OpenMOSE/RWKV-Qwen3-30B-A3B-Instruct-hxa07d"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

prompt = "Explain the key idea of limiting attention layers to 3 in a RWKV hybrid."
inputs = tok(prompt, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=256)
print(tok.decode(out[0], skip_special_tokens=True))

(Adjust to your actual runtime / custom loader depending on how the hybrid is implemented.)

Acknowledgements

This architecture research and implementation was made possible with computing power and technical support from Recursal AI. We sincerely thank them for enabling this work.

Distillation methodology is based on RADLADS, proposed by SmerkyG, whose ideas significantly influenced the training pipeline.

Citation

If you use this model in research or products, please cite the model repository and credit the contributors and supporting organizations.

2025 OpenMOSE

Downloads last month: 45

Safetensors

Model size

31B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSE/RWKV-Qwen3-30B-A3B-Instruct-hxa07d

Quantizations

1 model

Collection including OpenMOSE/RWKV-Qwen3-30B-A3B-Instruct-hxa07d

hxa07D RWKV-Transformer Hybrid series

Collection

New hxa07D family of hybrid models, combining improved RWKV recurrent architectures with Transformer-based attention. Designed for efficient long-cont • 4 items • Updated 8 days ago