Multi-head Latent Attention (MLA)

The use of kv-cache with MHA at inference time becomes the bottleneck for efficiency due to its large size.

Transclude of SDPA#a383c1

Transclude of MHA#2094e9

MLA, equipped with low-rank key-value joint compression, achieves better performance than MHA while requiring a significantly smaller amount of KV cache. The core of this can be expressed as:

c_{t}^{K V} k_{t}^{C} v_{t}^{C} = W^{DK V} h_{t} = W^{U K} c_{t}^{K V} = W^{U V} c_{t}^{K V}

where $c_{t}^{K V} \in R^{d_{c}}$ is the compressed latent vector for keys and values for a particular token; $d_{c} (≪ d_{h} n_{h})$ denotes the KV compression dimension; $W^{DK V} \in R^{d_{c} \times d}$ is the down-projection matrix; and $W^{U K}, W^{U V} \in R^{d_{h} n_{h} \times d_{c}}$ are the up-projection matrices for keys and values, respectively.

During inference, MLA only needs to cache $c_{t}^{K V}$ , so its KV cache has only $d_{c} l$ elements, where $l$ denotes the number of layers. This method of compression is conceptually similar to LoRA, which is used for fine-tuning.

In addition, during inference, since $W^{U K}$ can be absorbed into $W^{Q}$ , and $W^{U V}$ can be absorbed into $W^{O}$ , we even do not need to computes keys and values out for attention.

During training, low-rank compression can also be applied to the queries (although it cannot reduce the size of the KV cache):

c_{t}^{Q} q_{t}^{C} = W^{D Q} h_{t} = W^{U Q} c_{t}^{Q}

where $c_{t}^{Q} \in R^{d_{c}^{'}}$ is the compressed latent vector for queries for a particular token; $d_{c}^{'} (≪ d_{h} n_{h})$ denotes the query compression dimension; $W^{D Q} \in R^{d_{c}^{'} \times d}$ , $W^{U Q} \in R^{d_{h} n_{h} \times d_{c}^{'}}$ are the down-projection and up-projection matrices for queries, respectively.

Decoupled Rotary Position Embedding

As established by LLaMA (and used by DeepSeek), RoPE is an effective method of position embedding. However, RoPE is incompatible with MLA.

RoPE is position-sensitive for both keys and queries. If we apply RoPE for the keys ( $k_{t}^{C}$ ), the up-projection (* $W^{U K}$ ) will be coupled with a position sensitive RoPE matrix.

References

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

ML Notes

Explorer

Multi-head Latent Attention (MLA)

Decoupled Rotary Position Embedding

Graph View

Backlinks