DeepSeek v3.2

#29
by Diene10 - opened

26 octobre

为什么model文件里要先计算全部注意力分数 scores = (torch.einsum("bshc,btc->bsht", q_nope.float(), self.kv_cache[:bsz, :end_pos].float()) +torch.einsum("bshr,btr->bsht", q_pe.float(), self.pe_cache[:bsz, :end_pos].float())) * self.softmax_scale 之后再进行sparse索引topk_indices = self.indexer(x, qr, start_pos, freqs_cis, mask),这样不是冗余计算了吗?

Sign up or log in to comment