DeepSeek v3.2

#29

by Diene10 - opened 5 days ago

Discussion

Diene10

5 days ago

26 octobre

wu153

4 days ago

为什么model文件里要先计算全部注意力分数 scores = (torch.einsum("bshc,btc->bsht", q_nope.float(), self.kv_cache[:bsz, :end_pos].float()) +torch.einsum("bshr,btr->bsht", q_pe.float(), self.pe_cache[:bsz, :end_pos].float())) * self.softmax_scale 之后再进行sparse索引topk_indices = self.indexer(x, qr, start_pos, freqs_cis, mask)，这样不是冗余计算了吗？

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment