Scaled Dot-Product Attention (SDPA)

Scaled dot-product attention is an attention mechanism where the dot products are scaled down by $d_{k}$ . Formally we have a query $Q$ , a key $K$ and a value $V$ and calculate the attention as:

Attention (Q, K, V) = Softmax (\frac{Q K ^{T}}{d _{k}}) V

If we assume that $q$ and $k$ are $d_{k}$ -dimensional vectors whose components are independent random variables with mean $0$ and variance $1$ , then their dot product, $q \cdot k = \sum_{i = 1}^{d_{k}} u_{i} v_{i}$ has mean $0$ and variance $d_{k}$ . Since we would prefer these values to have variance $1$ , we divide by $d_{k}$ . We then apply softmax to the attention scores so that they are transformed into a set of positive, normalised weights that sum to 1.

References

Papers With Code

Attention Is All You Need

ML Notes

Explorer

Scaled Dot-Product Attention (SDPA)

Graph View

Backlinks