Papers
arxiv:2510.23052

Knocking-Heads Attention

Published on Oct 27
· Submitted by Zhanchao Zhou on Oct 28
Authors:
,
,
,
,

Abstract

Knocking-heads attention (KHA) enhances multi-head attention by enabling cross-head interactions, improving training dynamics and performance in large language models.

AI-generated summary

Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.

Community

Paper submitter

Knocking-Heads Attention (KHA) enables attention heads to "knock" and interact with each other before computing attention, addressing the limitation that traditional multi-head attention processes heads independently and simply concatenates outputs. Using a diagonally-initialized shared projection matrix, KHA facilitates cross-head feature interactions while preserving head specialization, achieving superior performance and training stability with minimal additional parameters and computation.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.23052 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.23052 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.23052 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.