DeepSeek-V2

DeepSeek-V2 improves upon the DeepSeekMoE architecture by introducing MLA, an attention mechanism equipped with low-rank key-value join compression. Empirically, MLA achieves superior performance compared with MHA, and meanwhile significantly reduces the KV cache during inference, thus boosting the inference efficiency.

Comparison of KV Cache

Attention Mechanism	KV Cache per Token (# Element)	Capability
MHA	$2 n_{h} d_{h} l$	Strong
GQA	$2 n_{g} d_{h} l$	Moderate
MQA	$2 d_{h} l$	Weak
MLA	$(d_{c} + d_{h}^{R}) l \approx \frac{9}{2} d_{h} l$	Stronger

where $n_{h}$ denotes the number of attention heads, $d_{h}$ denotes the dimension per attention head, $l$ denotes the number of layers, $n_{g}$ denotes the number of groups in GQA, and $d_{c}$ and $d_{h}^{R}$ denote the KV compression dimension and the per-head dimension of the decoupled queries and key in MLA, respectively.

For DeepSeek-V2, $d_{c}$ is set to $4 d_{h}$ and $d_{h}^{R}$ is set to $\frac{d _{h}}{2}$ . Therefore, MLA requires an amount of KV cache equal to GQA with only 2.25 groups, while achieving stronger performance than MHA.

References

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

ML Notes

Explorer

DeepSeek-V2

Comparison of KV Cache

Graph View

Backlinks