FlashAttention

FlashAttention optimised the execution of SDPA by fusing its component operations into a single kernel.

FlashAttention can be used to increase the context length during Training and for increasing the sequence length during generative inference.

FlashAttention-2 is an iteration on FlashAttention that improves it’s performance $\sim 2 \times$ .

References

FlashAttention GitHub

ML Notes