Grouped-Query Attention (GQA)

Grouped-query attention is an interpolation of MQA and MHA that achieves quality close to MHA at comparable speed to MQA.

References

Papers With Code

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

ML Notes