"INT-FlashAttention: Enabling Flash Attention for INT8 Quantization"

Playback speed

Share post at current time

0:00

Transcript

"INT-FlashAttention: Enabling Flash Attention for INT8 Quantization"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 28, 2024

INT8 quantization meets FlashAttention to supercharge inference speed on Ampere GPUs

First-ever fully INT8 attention operator that's 72% faster than FP16

https://arxiv.org/abs/2409.16997

🤖 Original Problem:

Self-attention in LLMs faces quadratic time and memory complexity challenges with sequence length. While FlashAttention improves this through GPU memory hierarchy optimization, existing quantization methods aren't compatible with FlashAttention's workflow, especially on widely-used NVIDIA Ampere GPUs.

-----

🔧 Solution in this Paper:

INT-FlashAttention introduces a novel token-level quantization architecture that:

→ Implements Q, K, V matrices in fully INT8 format

→ Uses INT8 GEMM kernels for all matrix multiplications

→ Integrates seamlessly with FlashAttention's online softmax workflow

→ Employs token-level quantization for Q and K matrices

→ Maintains tensor-level quantization for V matrix

-----

💡 Key Insights:

→ First attention operator with fully INT8 input

→ Token-level quantization framework adaptable to other formats like INT4