Multi-Query Attention (MQA)

MHA consists of multiple attention layers (heads) in parallel with different linear transformations on the queries, keys, values and outputs. Multi-query attention is identical except that the different heads share a single set of keys and values.

References

Papers With Code

Fast Transformer Decoding: One Write-Head is All You Need

ML Notes

Explorer

Multi-Query Attention (MQA)

Graph View

Backlinks